Artificial Fintelligence

The Bitter Lesson

Finbarr Timbers — Thu, 26 Jun 2025 15:28:07 GMT

The Bitter Lesson is an excellent essay which is overwhelmingly misunderstood. The point of the bitter lesson is that, over time, methods which scale with compute will outperform methods that do not.

It is not:

The idea that we should never incorporate human knowledge
The idea that deep learning and scale are all we need (Rich is actually relatively skeptical of deep learning)

The entire point of the essay is that, in the last 5 decades, we have seen massive increases in the amount of compute available to us as an industry and we expect to continue to see massive increases in the amount of compute available to AI research. Methods which take advantage of compute will benefit, and those that do not will suffer.

Subscribe now

The reason the lesson is bitter is that it is often much easier and quicker to get results by incorporating human knowledge.

If you’re training an autocomplete system in 1995, you’re probably not going to get very far with next token prediction, and instead, handcoded, or statistically generated rules will do better. In 2005, N-gram models are optimal. It isn’t until the mid 2010s that we start to see deep learning dominate in NLP, and not until the late 2010s that self supervised learning becomes dominant. At each point along the way, incorporating human knowledge has been advantageous, and has been a way that you can get an advantage over your competition. But in the long term, it’s a dead end. Methods which take advantage of more compute outperform over a sufficiently long time frame. Compute is the only parameter which we can expect to increase by several orders of magnitude. Much as I wish it were otherwise, it’s unlikely that we’ll see 1000x the number of tokens we have now. But in compute, that’s almost certain.

The canonical example is computer chess. Before Deep Blue expert systems were largely used. Deep Blue showed that leveraging compute to perform extensive searches against a hand coded value function1 performed extremely well. Deep Blue was a massive win for the “scale compute”/computer search crowd, as it was much more based on scale than human heuristics, but it required an evaluation function with 8000 custom chess features created by human experts, and the evaluation function weighted them using hand selected weights. One measure of the generality of the system is how easy it would be to extend it to a different scenario. To extend Deep Blue to work in Go would be extremely challenging, as one would need to come up with a proper evaluation function by creating another 8000 custom Go features.

Computer Go is another example where human knowledge fell short. AlphaGo Zero evaluated against the then state of the art Go bots, which included Pachi, GnuGo, and CrazyStone. Pachi and CrazyStone did MCTS with heuristic value functions, and GnuGo was an expert system, with a hand created decision tree to select moves. They were good at the time! But they were all ultimately dead ends. As Rich states in the article:

The bitter lesson is based on the historical observations that
1) AI researchers have often tried to build knowledge into their agents,
2) this always helps in the short term, and is personally satisfying to the researcher, but
3) in the long run it plateaus and even inhibits further progress, and
4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning.
The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.

If you look at GnuGo’s code, it was a lot of hard work by a lot of people, and was dramatically worse than what was possible. What’s surprising is that, while GnuGo began in 1989, released continued until 2009, so the authors were undoubtedly aware of Deep Blue and the stunning victory that scaled search had, yet they continued to push forward with their expert system. Brian Lee, a former Google Brain researcher who replicated AlphaGo within Brain, offers a compelling explanation for why:

I offer another point: that these stages [of the Bitter Lesson] happen over the span of a decade or so. Over this decade, PhDs are minted, career identities built, promotion criteria set, culture defined, and org charts annealed. Much in the way that science progresses one funeral at a time, progress on difficult problems progresses one organization shutdown at a time.

Consider another scenario. You work at a LLM lab, and have to make your benchmark numbers bigger than your competition or you get fired. You have the immediate temptation to include human knowledge, which in this case might be specialized datasets for a specific benchmark.

A better approach would be to make the model generally stronger. Focusing on methods that scale with compute as a filter is a strong bet to make, as Jensen Huang is doing his best to give you multiple orders of magnitude more FLOPS. Methods like test time compute, synthetic data, or MoE models are great examples. But the problem with this approach, which when I write it down seems obvious, is that in the moment, it feels indulgent. We don’t have time for proper science, we have to beat the other labs on LiveCodeBench. That is the bitter lesson: DeepSeek focuses on general improvements, gets them working, scales them to 3.8e25 FLOPS, and is SOTA.

Subscribe now

Articles I’m reading right now:

What comes next, by Nathan Lambert (Interconnects), discussing, among other aspects, how excellent O3 is.
Undertrained tokens in R1, by .
The Deep Blue paper, which is worth reading.

Deep Blue is fascinating for a variety of reasons, including the fact that they had custom “chess chips” made to encode the evaluation function in hardware.

Reinforcement learning and general intelligence

Finbarr Timbers — Thu, 05 Jun 2025 15:31:27 GMT

A disclaimer: nothing that I say here is representing any organization other than Artificial Fintelligence. These are my views, and mine alone, although I hope that you share them after reading.

Frontier labs are spending, in the aggregate, $100s of millions of dollars annually on data acquisition, leading to a number of startups selling data to them (Mercor, Scale, Surge, etc). The novel data, combined with reinforcement learning (RL) techniques, represents the most clear avenue to improvement, and to AGI. I am firmly convinced that scaling up RL techniques will lead to excellent products, and, eventually, AGI. A primary source of improvement over the last decade has been scale, as the industry has discovered one method after another that allows us to convert money into intelligence. First, bigger models. Then, more data (thereby making Alexandr Wang very rich). And now, RL.

RL is the subfield of machine learning that studies algorithms which discover new knowledge. Reinforcement learning agents take actions in environments to systematically discover the optimal strategy (called a policy). An example environment is Atari: you have an environment (the Atari game) where the agent can take actions (moving in different directions, pressing the “fire” button) and the agent receives a scalar reward signal that it wants to maximize (the score). Without providing any data on how to play Atari games, RL algorithms are able to discover policies which get optimal scores in most Atari games.

The key problem in RL is the exploration/exploitation tradeoff. At each point that the agent is asked to choose an action, the agent has to decide between choosing the action which they currently think is best (”exploiting”) or trying a new action which might be better (”exploring”). This is an extremely difficult decision to get right. Consider a complicated game like Starcraft, or Dota. For any individual situation that the agent is in, how can we know what the optimal action is? It’s only after making an entire game’s worth of decisions that we are able to know if our strategy is sound. and it is only after playing many games that we are able to conclude how good we are in comparison to other players.

Large language models help significantly here, as they are much, much more sample efficient because they have incredibly strong priors. By encoding a significant fraction of human knowledge, the models are able to behave well in a variety of environments before they’ve actually received any training data.

When it comes to language modelling, most use of RL to date has been for RLHF, which is mostly used for behaviour modification. As there is (typically) no live data involved, RLHF isn’t “real” RL and does not face the exploration/exploitation tradeoff, nor does it allow for the discovery of new knowledge.

Knowledge discovery is the main unsolved problem in modern machine learning. While we've become proficient at supervised learning, we haven't yet cracked the code on how to systematically discover new knowledge, especially superhuman knowledge. For AlphaStar, for instance, they spent a lot of compute discovering new policies, as it is an extraordinarily hard problem to discover good strategies in Starcraft without prior knowledge.

Therein lies the rub; RL is simultaneously the most promising and most challenging approach we have. DeepMind invested billions of dollars in RL research with little commercial success to show for it (the Nobel prize, for instance, was for AlphaFold, which didn’t use RL). While RL is often the only solution for certain hard problems, it is notoriously difficult to implement effectively. Consider a game with discrete turns, like Chess or Go. In Go, you have on average 250 different choices at each turn, and the game lasts for 150 moves. Consequently, the game tree has approximately 250^150 nodes, or ~10^360. If searching randomly (which is how many RL algorithms explore), it is exceedingly difficult to find a reasonable trajectory in the game, which is why AlphaZero style selfplay is needed, or an AlphaGo style supervised learning phase. When we consider the LLM setting, in which typical vocabulary sizes are in the 10s to 100s of thousands of tokens, and sequence lengths can be in the 10s to 100s of thousands, the problem is made much worse. The result is a situation where RL is both necessary and yet should be considered a last resort.

Put differently, one way to think of deep learning is that it’s all about learning a good, generalizable, function approximation. In deep RL, we are approximating a value function, i.e. a function that tells us exactly how good or how bad a given state of the world would be. To improve the accuracy of the value function, we need to be able to receive data with non-trivial answers. If all we receive is the same reward (and it’s really bad), we can’t do anything. Consider a coding assistant, like Cursor’s newly released background agent. One way to train the agent would be to give it a reward of 1 if it returns code which is merged into a pull request, and 0 otherwise. If you took a randomly initialized network, it would output gibberish, and would thus always receive a signal of 0. Once you get a model that is actually good enough to sometimes be useful to users, you can start getting meaningful signal and rapidly improve.

As an illustrative example, I have a friend who works at a large video game publisher doing RL research for games (think: EA, Sony, Microsoft, etc.). He consults with teams at the publisher’s studios that want to use RL. His first response, despite being an experienced RL practitioner with more than 2 decades of RL experience, is usually to ask if they've tried everything else, because it’s so difficult to get RL to work in practical settings.

The great question with reinforcement learning and language models is whether or not we’ll see results transfer to other domains, like we have seen with next token prediction. The great boon of autoregressive language models has been that it generalizes well, that is, you can train a model to predict the next token and it learns to generate text that is useful in a number of other situations. It is absolutely not clear whether that will be the case with models trained largely with RL, as RL policies tend to be overly specialized to the exact problem they were trained on. AlphaZero notoriously had problems with catastrophic forgetting; a paper that I wrote while at DeepMind showed that simple exploits existed which could consistently beat AlphaZero. This has been replicated consistently in a number of other papers. To get around this, many RL algorithms require repeatedly looking at the training data via replay buffers, which is awkward and unwieldy.

With LLMs, this is a major problem. Setting aside RL, in the open research space, we see a lot of VLMs that are trained separately from their LLMs equivalents. DeepSeek-VL2 is a separate family of models from V3, which is text-only, despite all the major closed source models accepting multimodal inputs. The main reason for the separation being that, in the published literature, adding multimodal capacities to LLMs sacrifices pure text performance. When we go to add in RL, we should expect the problem to become much worse, and more research to be dedicated to improving the inherent tradeoffs here.

In my experience as a practitioner, RL lives or dies based on the quality of the reward signal. One of the most able RL practitioners that I know, Adam White, begins all of his RL projects by first learning to predict the reward signal; and only then will try to optimize it (first predict, and then control). Systems that are optimizing complex, overfit reward models will struggle. Systems like the Allen Institute's Tulu 3, which used verifiable rewards to do RL, seem like the answer, and provide motivation for the hundreds of millions of dollars that the frontier labs are spending on acquiring data.

The development of AlphaGo illustrates this paradox perfectly:

RL was essentially the only viable approach for achieving superhuman performance in Go
The project succeeded but required enormous resources and effort
The solution existed in a "narrow passageway" - there were likely very few variations of the AlphaGo approach that would have worked, as can be seen by the struggle that others have had replicating AlphaGo’s success in other domains.1

We're now facing a similar situation with language models:

We've largely exhausted the easily accessible training data
We need to discover new knowledge to progress further
For superhuman knowledge in particular, we can't rely on human supervision by definition
RL appears to be the only framework general enough to handle this challenge

In short, this is a call for research labs to start investing in fundamental RL research again, and in particular, on finally making progress on the exploration problem.

Subscribe now

I actually can’t think of any successful applications of MCTS to solve real world problems. Other than the AlphaGo/AlphaZero/MuZero line of work, it doesn’t seem to have led to anything, which 2017 Finbarr would have found extremely surprising.

How to hire ML engineers/researchers

Finbarr Timbers — Thu, 16 Jan 2025 21:56:49 GMT

I’m going to assume that you’ve figured out how to find candidates which appear great on paper and your only problem is figuring out which of them to hire. Getting high quality candidates is more of a marketing/brand/sales exercise, which I don’t have that much experience with. Getting high quality candidates to apply is a non-trivial problem in the current market, particularly if you are trying to hire anyone with more than ~3 years of experience. But, nonetheless, it is beyond the scope of this article. I’m going to discuss how you should run interview processes for ML engineers/researchers.

Before I begin, a request: I’m writing an article about human data, so if you manage/use the results of a human labelling pipeline, or use signals from your users for model training/evaluation, please get in touch.

When discussing roles, I’m going to use the DeepMind classification, which has three main technical roles and a common experience ladder:

Software engineer (SWE), which is a standard software engineer that isn’t required to know ML or research (although it is, of course, an advantage).
Research engineer (RE), which is basically everyone who isn’t a SWE or a RS. Most companies in the LLM era that are hiring “researchers” that are expected to be able to code their ideas in large codebases are looking for REs. Their duties could run the gamut from managing experiments, to optimizing code, to doing novel research.
Research Scientists (RS), which is someone with a PhD whose success is judged entirely on their publication record. This job is not dissimilar from being a postdoc. Some RSs spend very little time coding, and some are better coders than most REs. The key differentiating factor between an RE and a RS is that an RS typically has weaker coding skills and spends more of their time thinking about what to work on next.

The hiring process for all of these roles is broadly similar. To hire for any of them (or any role in general!) you should be running work sample tests for the specific tasks you expect each of these candidates to be able to do, while maintaining a consistent high standard in your evaluation. The hardest part, by far, about running a good hiring process is getting buy in from the rest of your organization to continue to run a rigorous process, and to maintain a high bar. Often you have an immediate need to hire to meet some goal (if you don’t, you probably shouldn’t be hiring), so it’s always tempting to relax the bar slightly. Don’t. If you do, you’ll wake up 18 months later with a mediocre team.

I’m going to focus mostly on interviewing candidates who fall in the RE bucket, as that’s what most organizations in the product-driven research era need. These are candidates who can implement all of their ideas and have the technical expertise to run large-scale experiments by themselves.

Subscribe now

Work sample tests

You want your interviews to be as close to the job as possible. I dislike Leetcode questions for this reason. They can have their place, as they’re generally a good way to screen for competency/conscientiousness, but they tend not to work as well with researchers/ML engineers as they spend less time preparing for Leetcode.

An approach I like is to take problems that have come up during your team’s work and turn those into tasks. One that I like is debugging a real world ML problem. A question that I have used in the past is “I have a new idea to make our models better. I implement it. It doesn’t work. What should I do?” This is a common problem that happens at work all the time! I try something, it doesn’t work, and I grab a coworker to talk to them about it. Another variant is to take a script that works, and add a bunch of common bugs to it to see if they can find them, ideally bugs that have happened as part of your work.

Another question that I like to ask is to discuss evaluation, and probe the candidate on which problems can come up with evaluation. There are many weird ways that evaluation can fail, most of which aren’t explicitly written about, so it’s a good screen to see what a candidate has experience working on.

When asking these questions, one useful tactic is to allow long, uncomfortable silences to develop. Your general rule of thumb should be to let the candidate talk as much as possible (and a good metric would be % of time the interviewer is talking— should be as close to 0 as possible). If you ask the candidate a question, like the “new idea doesn’t work” question above, be ready to let the question hang in the air while you sit in silence until they answer. You want to 1) let the person think and 2) see how they react.

The main goal with questions like these is to get away from the contrived Leetcode like problems which can be memorized or prepared for, and instead focus on questions which require practical experience in the role. Those have value, to be clear, but I don’t think they’re as relevant for the research family of roles.

Be careful about what you include/exclude

Your candidates will have shocking gaps in knowledge. If you don’t test for a skill, you can’t assume candidates will have that skill. This is true even for “obvious” skills like “can use source control.” I have worked with very skilled researchers that barely know how to use Git, and have basically no experience working in a team on a shared codebase.

The corollary to this is that if there’s a skill that you think “everyone should have”, many people won’t, so if you screen for it, you will remove them from the candidate pool. Be careful as to whether or not you actually want to remove them from the candidate pool; if a skill is not required, you are needlessly making the candidate pool weaker.

For instance, I have some friends that run a company using reinforcement learning (RL) to control industrial facilities. They are world experts in RL. I encouraged them to not screen for RL skills, but only to screen for general ML expertise, as they are probably the best people in the world to upskill their employees in RL.

If you’re not sure what skills you want to include/exclude, particularly on the behavioural side, I would encourage you to read Talent, by Tyler Cowen and Daniel Gross. It’s a good overview. A particular skill I like to see is that someone has a track record of relentlessly doing what was necessary to make their project succeed, across abstraction levels. For instance, Julian Schrittweiser, the lead for AlphaZero, did everything from writing papers, coming up with research ideas, implementing the Jax training pipelines in Python and writing highly-optimized C++. On the flip-side, if candidates restrict themselves to only engaging in certain subsets of the project— not having a history of cleaning data, or only engaging at the idea level, and not writing any code— I would view that as an anti-signal.

Screen candidates aggressively

A common anti-pattern that I have seen is where companies will only screen for technical skills, or, if they do other, behavioural interviews, they will only focus on leadership/teamwork skills. These are really important! But an area that screws a lot of AI companies up is that they will be hiring incredibly skilled ML people coming directly from academia, who do not want to work on products. Many PhDs, and a lot of master’s/undergrads graduates, are only familiar with academia, and value their publication record above all else, including compensation. When hiring general software engineers this is not typically an issue, as most software engineers want to build products that are successful and make money.

Until recently, many of the large industrial research labs (DeepMind, FAIR, MSR, etc.) were run in a manner very similar to academia, and the way to advance one’s career was to publish academic papers in academic journals, so many people who have spent their careers at these organizations are still immersed in the academic mindset. They have not spent any of their professional lives trying to improve business related KPIs, and many have no experience orienting their work around organization level business goals (like OKRs). For many product companies, particularly startups, this is exactly the opposite of what they need. It is a point of pride for some researchers that they work on “pure science” which has no apparent useful application (if this mindset is strange to you, there’s a famous essay by Hardy, a famous mathematician, who explains it in detail).

My advice is to spend the first call with the candidate addressing this explicitly, perhaps saying something like “Publishing papers is not a priority to us. Do not expect that you will ever publish a paper as part of your job. You will be expected to work on research that is driven by the needs of the product/business and will not have academic freedom to pursue whatever ideas you find interesting.” It may sound harsh, but this is true at most companies, and it’s worth making it explicit upfront.

I would generally advise that other unappealing aspects of working at your company should be mentioned in the first call as well. Matt Mochary has written about how important the anti-sell is. You want to give candidates an accurate understanding of what working at your company will be like; one of the worst outcomes for you is to hire someone, spend a lot of time onboarding/training them, and have them churn because they don’t actually like the job. Do this as early in the process as possible, ideally the first call.

Hiring scientists vs engineers

Using the DeepMind RE vs RS distinction, many companies only have what DeepMind would call REs, as you need to be able to implement your research in large codebases. The main difference is that, for research scientists, coding ability is less important, the ability to choose the right problem is much more important, and you have to focus on culture fit more.

Many people who are sticklers about the “scientist” label instead of being ok with the engineer label 1) expect to be able to publish papers and 2) expect to do “pure” research that’s not driven by product needs. That’s often not acceptable at most companies, so they will be unhappy and churn 12-18 months in. Screen for this.

Behavioural skills matter

You have to screen for behavioural skills Particularly once engineers/reseachers hit the senior level (using the Google scale, so L5+) soft skills are more important than hard skills. Possibly even earlier in their career.

Mentoring matters. Feedback matters. Connecting with your teammates matters. For more junior roles, being “teachable” matters more, but as the person gets more senior, their ability to mentor and give feedback becomes more and more important.

Additionally, as a researcher, often you are dealing with ambiguity throughout the research process, so it’s important to discuss your experiments with your coworkers. If someone is particularly disagreeable, this will not go well, which can make your team less productive.

If you’re hiring someone from a large company, it’s important to assess their ability to add more process in a reasonable way. A common failure among senior people from big tech that are too “bigtech minded” is that they will add too much, unnecessary process, or expect to be able to grow their team quickly to match the staffing levels they’ve been historically used to.

Keep a rigorous process

It’s easy to think “we need someone so let’s hire someone quick.” Don’t. Keep a high bar and encourage the rest of your team, particularly if you’re at a company paying top of market. Otherwise, you’ll wake up 12 months later with a mediocre team.

Jeff Bezos had a list of questions:

“Will you admire this person?”
“Will this person raise the average level of effectiveness of the group they’re entering?”
“Along what dimension might this person be a superstar?″

I think this is the right approach. You should generally try to only hire the best people, and you can get by with a surprisingly small team. I subscribe to the Nat Friedman philosophy:

Smaller teams are better:
Faster decisions, fewer meetings, more fun
No need to chop up work for political reasons
No room for mediocre people (can pay more, too!)
Large-scale engineering projects are more soluble in IQ than they appear
Many tech companies are 2-10x overstaffed

Thanks to Morgan McGuire, Tom McGrath, Kostis Gourgoulias, Sholto Douglas, Priya Joseph, Pavel Surmenok, and Johnny for reading drafts of this.

Subscribe now

Finally, again: I’m writing an article about human data, so if you manage/use the results of a human labelling pipeline, or use signals from your users for model training/evaluation, please get in touch.

Misc resources

Matt Mochary gives great hiring advice (and is generally worth reading, particularly on why references are a waste of time)
Sam Altman on hiring
Talent, by Tyler Cowen and Daniel Gross, is a great book.

Papers I've read this week: vision language models

Finbarr Timbers — Mon, 28 Oct 2024 15:34:48 GMT

I’ve been on a VLM kick lately, trying to read as many papers about vision language models as possible. This was inspired by Claude being ridiculously good at converting equation screenshots to LaTeX, which made me want to understand how LLMs can be so good at understanding pictures and doing fancy OCR. I remember using Tesseract and ABBYY FineReader back in the day, and finding them slow/hard to work with. Now, with VLMs, it seems like reading text from pictures is a solved problem? Definitely surprised me.

In any case, I realized that I didn’t have a good understanding of how VLMs worked, so I wanted to change that. Here’s the results of my efforts to change that!

Funnily enough, 2 VLMs have been released since I started writing this: Pixtral, and DeepSeek Janus, both causing the article to be delayed.

Spoiler: it’s actually super straightforward. It turns out that using some vision encoder (typically a ViT, initialized from a good open source model) to convert images into features, patchifying it, and concatenating the resulting sequence with the text embeddings, is basically enough. There are some fancier architectures, but they don’t seem noticeably better.

I will be giving a talk as part of the PyTorch Expert Exchange lecture series on how batching works on modern GPUs, based on the article I wrote last year. Please join me!

Finally, this article is long. You might want to read it on the web instead.

Subscribe now

The evolution of multimodal model architectures

[abstract]

I began my mission to understand VLMs with this survey paper from Purdue. It’s a high-level overview of multimodal architectures, grouping them into 4 categories:

Type A and Type B, which combine multimodal inputs within the internal layers of the model
Type C and D, which combine the modalities at the input stage.

Type A employs cross-attention, while B uses custom layers for modality fusion.

Type C uses modality specific encoders, while D uses tokenizers to convert every mode into tokens, and then processes them together.

Examples of models which fall into the various categories:

Type A: Flamingo, and various Flamingo derived models
Type B: CogVLM, MoE-LLaVa
Type C: DeepSeek-VL, LLaVa, Sphinx, Emu, Qwen-VL
Type D: LaVIT, Unified-IO

Contemporary open source models are almost all doing Type C. Type D is somewhat common in video models (e.g. MagViT2) but most multimodal papers aren’t bothering to convert the image features into tokens, but passing the patchified features directly into the decoder.

Of the papers that are notable to me, Type C seems dominant. I suspect that most closed models, like Reka, Claude, and Gpt4o, are doing Type C. My logic is that, in deciding between deep fusion, where modalities are combined within the internal layers of the model, and early fusion, where they’re combined at the input, the large labs will be focusing on the most general approach, and follow the bitter lesson, which states that we should learn as much of our structure as possible, rather than imposing pre-determined inductive biases.

The paper is useful as an overview, but it does get bogged down in coming up with a detailed taxonomy of VLMs which I think is of questionable utility. Great intro if you’re unfamiliar with the space.

Flamingo

[abstract]

Published in April ‘22, Flamingo was one of the early multimodal LMs. It focused on enabling zero-shot adaptation to novel tasks that use text and image inputs. They evaluated on 16 tasks, and reached SOTA in 6 of them despite using ~3 orders of magnitude less task-specific data (yet another win for the bitter lesson!). The architecture is interesting; they combine a pretrained & frozen vision encoder with a similar frozen & pretrained language model, and only train dense cross-attention layers, along with a Perceiver Resampler layer on the top of the vision encoder. This is a much more complicated architecture than we’ll see in later models, which tend to all be decoders with minor tweaks.

The core idea is that they have a Perceiver Resampler on top of a frozen vision encoder which takes visual inputs and outputs a fixed number of tokens. These are used to condition a frozen LM (various Chinchilla models) using freshly initialized gated cross-attention layers.

The model seems to work well, but was very complex. With the benefit of hindsight, it’s interesting how unnecessary the complexity was, as none of the subsequent SOTA models used a similar architecture.

Qwen-VL

[abstract]

Published in August 2023, this is when we start to see architectures/training methodologies arise that are very similar to the state of the art in Q3 2024. The Qwen-VL series of models are based on the Qwen LLM, and add visual capacities. They claimed significant improvements in the SOTA compared to other open VLLMs as of Q3 2023. The model architecture is fairly similar to what we’ll see moving forward; they use a ViT-bigG, initialized from Openclip’s model, and resize images to 224x224, splitting the images into patches. They then use a vision-language adapter to compress the image features. The adapter is a single layer of cross attention, which uses a number of learned embeddings as query vectors and the image features from the visual encoder as keys for cross-attention, outputting a sequence of length 256. They use 2D absolute positional encodings on the output of the cross-attention layer. They have three stages of training during which they freeze various parts of the model.

They add special tokens (, ) to the sequence to denote the start/end of image content, and also train the model with bounding boxes, including them as text tokens which are tokenized in the standard way, but with two types of special tokens: , to denote the coordinates, and , to denote the text description corresponding to a given bounding box.

The model is pretrained on web-scraped image-text pairs, and then trained on high-quality, fine-grained annotation data. They have an additional supervised fine-tuning stage that they use to create a Chat model. The result is quite strong, and achieves SOTA in a variety of tasks.

CogVLM

[abstract]

Published in late 2023, CogVLM uses a frozen trained language model and image encoder, and combines the two with a trainable visual expert module in the attention and FFN layers, enabling vision features without sacrificing NLP performance. It is SOTA across a number of multi-modal benchmarks. This was likely implemented contemporaneously with Qwen-VL, but definitely has more in common with the pre-Qwen architectures like Flamingo.

They add two trainable layers to each transformer block: a MLP, and a QKV matrix, initializing them from their pretrained counterparts in the language model.

Interestingly, they assign all the image tokens a single position ID for RoPE, with the logic that visual tokens already encapsulate positional information when inputted into the ViT, and that by adding additional positional information, the query would focus more on the closer tokens, i.e. the the lower part of an image.

The authors did a lot of ablation experiments to justify the choices they made. These are great and really informative for training VLMs.

DeepSeek-VL

[abstract]

From March 2024, the DeepSeek take on a VLM appears to be a refinement of the Qwen-VL approach. For the visual feature module, the authors use a hybrid encoder, which has a text-aligned encoder for coarse semantic extraction at 384x384 resolution with a high-resolution encoder that operates at 1024x1024 resolution. The module represents a 1024x1024 image as 576 tokens. The high-resolution encoder is based on SAM-B, while they use SigLIP-L for the low-resolution image inputs. SigLIP can generally be viewed as a “better CLIP” so this is a modernization/improvement on what Qwen-VL did.

The authors use a two-layer hybrid MLP as a vision-language adapter, with two distinct single-layer MLPs processing the high- and low- resolution features separately; these features are then concatenated together, and transformed into the LLM’s input space through another layer of MLP.

The authors pretrain the models beginning with an intermediate checkpoint from DeepSeek-LLM, and continue to use extensive text-only data, with 70% of the tokens seen during training coming from the DeepSeek text-only corpus. The authors keep the LLM and the vision encoders frozen while they train the vision-language adaptor module, and then jointly train the LLM + VL adaptor on interleaved VL + text-only sequences. Finally, they finetune the entire model on chat data, unfreezing everything.

The authors achieve SOTA in the majority of the benchmarks they evaluate when compared against other open-source 7B models. Unsurprisingly, the proprietary LLMs, like GPT4 or Gemini Pro, are significantly better (and presumably significantly larger). Their model doesn’t see significant degradation on language benchmarks, which has consistently been a problem plaguing VLMs, which tend to have rapidly degraded performance on LLMs; I suspect that the large % of text-only pretraining data was sufficient. This is a good counterexample for the frozen text models we’ve seen consistently.

Chameleon

[abstract]

Published by Meta in May of ‘24, Chameleon is what I would think of as a great example of a “modern” multimodal model which uses early fusion to treat all modalities as discrete tokens. The loss here is the standard autoregressive loss, with / tokens being used to insert the image tokens into the sequential input. Chameleon achieves SOTA on a number of benchmarks (visual question answering, image captioning), and is competitive with text-only models of a similar size (Mixtral 8x7B, Gemini-Pro).

The authors train the model on a variety of orderings, including text-only, text-image pairs, and full interleaved text-image documents. This is, notably, the model that was trained with the most compute and the only one trained from scratch. As you’d expect, performance is quite strong.

For the image tokenizer, they use Make-A-Scene, encoding a 512x512 image into 1024 discrete tokens, using a codebook with size 8192. The authors note that this tokenizer is particularly bad at reconstructing images with a large amount of text, which is to be expected given the codebook size. They use a standard BPE tokenizer for text. They are one of the only VLMs that actually tokenizes their images rather than passing image features directly into the decoder.

The authors ran into stability issues when training models with more than 8B parameters & 1T tokens, with instabilities happening late in training. They used a Llama2 architecture, with RMSNorm, SwiGLU, and RoPE. They found that the softmax operation was leading to complex divergences because the different modalities had significantly different entropy, so the different modalities would “compete” with each other by trying to increase their norms, which would eventually explode. This is similar to the logit drift problem in the text-only setting. Consequently, the authors used QK-Norm and added dropout after the attention & feed-forward layers, which was enough to stabilize their 7B model. To stabilize their 34B model, they also had to reorder the norms, like so:

The changes were quite dramatic, which violates my intuition. I did not expect norm reordering to have such a big impact on the training loss, although pre/post layer normalization has had a significant impact in certain settings, so I should perhaps not be surprised.

They found that dropout wasn’t necessary with norm reordering, but QK-Norm was. This makes sense; I think QK-Norm should generally be used by default.

PaliGemma

[abstract]

From July 2024, PaliGemma continues to demonstrate the superiority of Lucas Beyer’s SigLIP loss, combining a 400M SigLIP model with a 2B Gemma model into a <3B VLM that is SOTA on a variety of tasks despite being (relatively) small.

SigLIP, standing for Sigmoid loss for Language Image Pre-training, is a loss that operates solely on image-text pairs and does not require a global view of the pairs for normalization. It can be thought of as a replacement for CLIP. It is defined as:

Where $x_i$ is the normalized feature embeddings from the image datapoint, and $y_j$ is the normalized feature embeddings from the text datapoint, and $z_{ij}$ is 1 if the image and text datapoints are paired, and $-1$ otherwise (no, your browser isn’t broken— there’s no good way to do inline math with Substack, unfortunately). PaliGemma uses a SigLIP image encoder to provide image features which are concatenated with the text tokens and processed by a standard decoder.

Unlike Flamingo, the entire PaliGemma architecture is trained jointly on multimodal tasks, with nothing being frozen. Unlike Chameleon, the individual components are initialized from previously trained models (SigLIP 400M and Gemma 2B). It is very similar to DeepSeek-VL.

PaliGemma was SOTA on the MMVP benchmark (47.3%), and did well on the rest. This is remarkable given how small and (relatively) cheap it was to train. Notably, it beat GPT4-V (38.7%) and Gemini (40.7%) on this task. That’s remarkable given that these models presumably are much bigger and saw much more data during training.

Pixtral (12B)

[abstract]

From October 2024, Pixtral is a 12B parameter multimodal model, using a vision encoder trained from scratch which ingests images at their natural resolution and aspect ratio, allowing it to vary the number of tokens used for an image. It has a long context window of 128k tokens.

Pixtral also has a novel RoPE-2D implementation.

It is based on Mistral Nemo 12B. They train a new vision encoder from scratch, Pixtral-ViT, which is a 400M ViT. It has 4 key changes vs a typical ViT:

They include tokens between image rows (as they scan patches in raster order), and include an token at the end of an image sequence.
They use gating in the hidden layer of the FFN.
To efficiently process images, they use sequence packing, flattening images along the sequence dimensions and concatenating; they then use a block-diagonal mask to ensure no attention leakage happens.
RoPE-2D. They replace the standard learned & absolute position embeddings for image patches with relative, rotary position encodings. The function is kinda complicated:
where $M_{\Theta}^{(i, j)}$is a diagonal block matrix such that $M_{\Theta}^{(i, j)}[k: k + 2, k: k+2]$ are the only non-zero entries, with each 2x2 block being equal to
where $l = i$ if $k$ is odd, and $j$ otherwise, with $\Theta = [θ1…θd/2]$is a vector of frequencies for the various dimensions of $x$, where $\theta_m$ is defined following standard, 1D RoPE.
Their implementation satisfies the “relative” property, where inner products between two vectors depend only on their relative difference in height/wodth indices, which is generally agreed to be highly desirable.

The Pixtral vision features are included in the decoder via a two-layer fully connected network, using a GeLU activation on the intermdiate, hidden layer. Image tokens are treated identically to text tokens by the multimodal decoder, including RoPE (1d) embeddings for each token, in particular using causal attention. It’s a complicated architecture but has some nice properties, particularly the variable length, native resolution image encoding.

Unfortunately, there’s no information provided about how they trained the models, and in particular, about whether or not the LLM/vision encoder were frozen during training. The model is notable for being (relatively) quite large, at 12B parameters. That’s much more than the other models discussed here.

DeepSeek Janus

[abstract]

Well, as mentioned above, I was about to publish this article, and DeepSeek released Janus. Janus is a fascinating piece of work as, in true DeepSeek fashion, it’s an actual novel architecture. DeepSeek has two visual encoders: one for visual understanding, and one for image generation.

Otherwise, the architecture is basically the same as what we’ve seen before, a typical “early fusion” model that uses a pretrained LLM to process the visual features. The visual understanding model uses a SigLIP encoder to extract visual features, which are then flattened into a 1D sequence, and they have an “understanding adaptor” which maps these image features into the input space of the LLM. For visual generation, they use a VQ tokenizer from LlamaGen to convert the images into sequences of discrete IDs, which are then transformed into embeddings via a generation adaptor. The result is a multimodal sequence that is fed into the decoder model. They use a tree-stage training process which is, again, very similar to other models (particularly, and unsurprisingly, DeepSeek-VL):

Their loss function is simply cross-entropy, and for understanding tasks (either text or multimodal) they compute the loss only on the text sequence, while for visual generation tasks, they compute the loss only on the image sequence. During inference, they use CFG in the standard way, with a scale of 5. They claim SOTA on generation when compared to some older models, e.g. DALLE-2, SDXL, etc., when evaluated on the GenEval benchmark, which, while interesting, is not particularly compelling. I’d prefer to see an Elo ranking vs the most commonly used standard models, such as Flux. The model appears good at prompt following, but not particularly aesthetic; I suspect it would fair poorly against, say, Flux. Having said that, the model performs very well for a 1.3B model.

Conclusions

With Chameleon, Pixtral, and PaliGemma, it looks like training methodologies are starting to converge. I think that the architecture used by Pixtral (and the pretraining recipe used by Chameleon) will basically be the recipe that most SOTA VLMs are using, if they’re not already.

Something worth noting is how (relatively) little compute was used to train many of the open source VLMs. The biggest LLaVa model used 768 A100 hours. DeepSeek-VL used 61440 A100 hours (512 A100s for 5 days). PaliGemma used ~12k A100-equivalent hours (18k TPU v5e hours).

Compare to, say, MovieGen, which used 6144 H100s for an unknown amount of time, or Stable Diffusion, which used 150k A100 hours. Presumably the large AI labs are using much more compute/data to train their models; Chameleon, for instance, using 10T tokens, is much more in line with what I would expect for a SOTA model.

The two big decisions that have to be made when training a VLM appear to be 1) how to combine image/text inputs and 2) whether or not to train the visual & language features separately. CogVLM did very well with a 17B model with keeping the models frozen. DeepSeek-VL trained everything after some careful freezing during pre-training. Chameleon trained everything from scratch. I think that it’s obviously cheaper with little to no degradation in quality to use a pretrained encoder, so I think that’s the route to go with. And “early fusion”, where the image features are concatenated to the text embeddings, seems elegant while performing well. That is probably the route I would go (basically following PaliGemma).

In short, the standard recipe appears to be:

Use a ViT for the image encoder, initialized from a large open source model (SigLip, Clip, etc.)
Use a pretrained LLM as the base model
Finetune the combined model

I see no evidence that it makes a difference whether we add the image/text data as inputs to the model, or do some fancier combination deeper in the model, like CogVLM or Flamingo did. The more recent models, like Qwen or DeepSeek, do well just having the image features added to the sequence data.

Subscribe now

Finally, another reminder that I will be giving a talk as part of the PyTorch Expert Exchange lecture series on how batching works on modern GPUs, based on the article I wrote last year. Please join me!

Papers I've read this week

Finbarr Timbers — Wed, 10 Jul 2024 00:41:34 GMT

This is a grab bag of papers. No theme, just what I found interesting. I’ve had a bunch of tabs open and finally (finally) got through them.

I hope to write more frequently going forward: the goal is once per month. My darling toddler has not been sleeping consistently, so my writing time has been exceptionally limited. Currently this has improved, and with luck, will stay improved.

Training LLMs over Neurally Compressed Text

[abstract]

The authors train LLMs over compressed text. When training language models, the current paradigm doesn’t involve raw text, but instead, trains the model over sequences of tokens, which are, basically, compressed text. The most common tokenizer is BPE (used by GPT-3, Llama/Mistral, etc.). The idea behind tokenization is that tokenizers transform the raw text into a much more efficient representation: BPE is typically 4x more efficient than raw bytes, so the LLM sees 4x the data for a given computational budget.

The natural question, then, is: why stop at 4x? Is there something better than BPE? There really hasn’t been— almost every LLM uses BPE for tokenization, although, as usual, there’s a lack of details about the latest foundation models. In the limit, a perfect compression algorithm should remove all predictable information from the sequence of bytes, so that shouldn’t be predictable, but could a tokenizer that’s, say, 8x more efficient than raw text be 2x as good as BPE?

The authors use a variety of compression techniques to train LLMs on ever more compressed text, looking at:

GZip
LLM based compression (which achieved a 12x compression ratio!).
Arithmetic encoding with logits conditioned on the sequence of text seen so far, i.e.
Arithmetic encoding with static logits, i.e.

They also use a technique which they developed, that splits the text into equal sized windows that each contain 32 bits of compressed information.

They find that their best models underperform subword baselines, and all the compression schemes (including GZip, which I found surprising) are learnable by standard LLMs, but the performance is worse than standard sub-word tokenizers, like BPE. However, their method does outperform byte-level baselines.

To a certain extent, this is unsurprising; the goal behind compression is to remove any predictable patterns from the original sequence of bytes, so if we had a perfect compressor, the resulting output would be indistinguishable from random noise. What is surprising, though, is that BPE just happens to be the sweet spot for compression.

How arithmetic coding works is:

A message is represented by an interval of real numbers between 0 and 1.
As the message grows, the interval needed to represent it becomes smaller, so the number of bits needed grows.
you take as inputs an alphabet, which assigns a cardinality to the characters (i.e. an ordering from 0 to n) and a model that assigns probabilities to the characters from the alphabet conditioned on the previous characters in the sequence, i.e.
Finally, we get an interval of two floating point numbers that represent the number.

The original paper has a great example describing exactly how this works. The key takeaway is that arithmetic coding presents a way to use a probability distribution to compress text, and the better that our model represents the underlying distribution over characters, the more efficient the message.

The authors train a 3M parameter decoder in a fairly standard way, and use encoder

They use equal information windows, where they encode text into a series of N-bit windows, resetting the AC encoder when it can no longer add bits without exceeding the target bit threshold. Windows represent variable amounts of text, but should represent the same amount of information.

Once they have the compressed sequence of bits, they then create tokens by grouping every N bits into a token, creating a vocabulary of size 2^N. They try with N = 8 and N = 16. This seems suboptimal to me— there’s no semantic meaning to the tokens!

The paper has a fascinating table showing how each of these changes weakens the compression ratios:

The authors make a point of using the same hyperparameters for every training run they do. I think this is a mistake; a proper benchmark would tune the hyperparameters for each setting.

Their results are interesting:

ArithmeticCoding and StaticAC settings are “essentially unlearnable”, failing to outperform their naive baseline which assigns equal probability to all tokens (aside: I love this baseline. more papers should include dumb heuristic baselines. we had a “UniformRandom” baseline agent in all our game theory papers and it performed remarkably well.)
EqualInfoAC performs the best, coming close to matching SentencePiece.

They have some really interesting ablations, which show that the SentencePiece tokens are much more semantically relevant than EqualInfoAC:

The other ablations are fascinating. This was an excellent paper with strong empirical work. I would encourage you to read it.

Subscribe now

Mixture of Depths

[abstract]

A windmill that is constantly being tilted at in decoder-centric LLM research is the fact that each token receives the same amount of compute. This seems clearly wrong. This paper proposed a novel method, Mixture of Depths (MoD), that dynamically allocates FLOPs to specific positions in a sequence.

The obvious comparison is to the Mixture of Experts (MoE) models. The MoD method can be thought of as using the routing logic from MoE models, but deploying a single expert with dynamic skipping based on the routing logic.

At a high level, the algorithm is:

Determine a compute budget which limits the number of tokens in a sequence that participate in a given block (say: 50% of the sequence participates in self-attention).
Use a per-block router to emit scalar weights for each token.
Select the top-k weights per block and sequence to participate in the block’s computation.

Note how similar this is to a MoE model.

They use expert choice routing, as it removes the need for the oft-complicated auxiliary losses which are required for token-choice routing. The big problem, of course, is that the top-k operation is non-causal, which is why expert-choice routing isn’t used in any (published) MoE papers. They use a causal predictor-based approach, which has minimal degradation:

Their results are nice; they find that MoD is a nice improvement, lowering loss compared to the isoFLOP vanilla model. Additionally, they find that the MoD improvements compound with the improvements from training a MoE model:

The implications of this paper are fascinating; one can imagine a family of ever more complex routing mechanisms which let every decision become learned.

Sparse Upcycling: Training MoE models from dense checkpoints

[abstract]

The paper proposes a method to initialize a MoE model from a dense one, showing that this outperforms the original, dense model with only using 50% of the original compute, and also outperforming MoE models trained from scratch with the same compute budget.

The paper makes the fascinating observation that the vast majority of SOTA neural network are trained from scratch, which, in many ways, is assumed to be the default way that all models are trained. Given the lottery ticket hypothesis, it’s not at all clear that this is optimal. Even though the weights are chosen randomly, this doesn’t mean that they’re good; the joke that RL researchers like to make that the random seed is a crucial hyperparameter is actually a valid tactic to take when deploying systems into production. If OpenAI could produce a better ChatGPT by using seed 3 rather than seed 5, they’d absolutely do that.

In any case, the paper explores developing cheaper ways of training large models by using existing models. This is much more relevant now than when the paper was released (ICLR 2023, so the work was probably done in the second half of 2022): we’re training much larger models with much more compute and doing so much more often.

Generally speaking, Mixture of Experts work by having N copies of each layer (we call each copy an expert), and learning to allocate tokens to each expert. The upcycling algorithm that they propose creates a MoE model by expanding a subset of the MLP layers in the original, dense model into MoE layers, and copying the remaining layers of the dense model across to the MoE. Each expert is then initialized identically, as a copy of the original MLP, and the routers are randomly initialized. They continue training the model using the same hyperparameters (batch size, learning rate, etc.). They even find that resuming the optimizer state from the original dense checkpoint works well, which I find surprising.

The results are, bluntly, quite good. The slope of improvement in the validation metrics seems to change immediately. It seems as if by upcycling the model, new avenues of performance are immediately unlocked.

Given the earlier results from the Mixture of Depths paper, which suggested that MoD models compose with MoE models, this suggests a natural experiment: train a MoD model to some level of performance, and then upcycle it to a MoE/MoD model. As an engineer, I shudder at the resulting complexity, but it should have a significant jump in performance.

Replication of Chinchilla

https://twitter.com/borgeaud_s/status/1780988694163321250

[abstract]

Finally, this was quite short, but an exciting development that gave me faith in research. A team at Epoch AI, an independent research institute that does excellent work, tried reproducing Chinchilla by transcribing the data from the pixels of the chart. Let me repeat that: they reproduced Chinchilla by transcribing the data from the pixels of the original paper. And! What’s more! They found an error in the paper that caused the original authors to issue a correction and promise to release the data. Truly excellent work on their behalf.

The root of the paper comes from the observation that the error bars from Approach 3 of the Chinchilla paper are extremely narrow ([0.454, 0.455] and [0.542, 0.543] for parameters a and b):

Given they only had approximately 400 observations, that’s implausibly tight. This led to Epoch recreating the data and fitting their own model, which found that the original Chinchilla paper had an error in the estimation code, causing the equation they found to not fit the data particularly well.

The revised results imply that you can lean much more on the compute side of the compute/data tradeoff:

If we revisit the excellent “chinchilla’s wild implications” article by Nostalgebraist and plug in the numbers for GPT-3, Chinchilla, and Gopher, we get that:

Across the board, the contribution of data to the loss increases, while the contribution of the model size decreases. So the conclusion from the updated data is that the number of tokens is less important than originally thought, but GPT-3 era models were still trained with not nearly enough data. This would imply that following a GPT-3 style approach, which trains a large model on less data, is more reasonable than the original Chinchilla paper implied.

In practice, it’s not clear how much this matters, because everyone is trying to get as much data as possible and train the biggest model they can afford, but it does show that on the margin, spending more money on GPUs and less on data labellers is worth doing (did Jensen sponsor this work? 🤔).

Misc AI articles I’m reading:

An excellent article by Ege Erdil of Epoch about why we should expect inference & training budgets to be approximately equal
Horace He’s excellent article on why matrix shapes matter so much for performance.

How does batching work on modern GPUs?

Finbarr Timbers — Fri, 01 Mar 2024 16:35:17 GMT

The first and most important optimization you can do for any modern deep learning system, generally speaking, is to implement batching. When you batch inference, instead of sending a single input, you send a batch of N inputs. Most of the time, depending on the specific value of N, this is free— running inference takes the exact same amount of time with a single example as it does with N examples. Why is this? At first, it seems like processing the batch shouldn’t be free— after all, Nx more work is being done.

And with a naive model of how neural networks work, it isn’t free. The batched calculation requires Nx the compute to run, and, in fact, if you run this on CPU, you’ll see that this is true (average inference time for ResNet50, Colab):

However, when you run the same example on a modern GPU, this isn’t the case. This is what we see (on a T4):

Going from a batch size of 1, to 2, to 3 requires no additional time, and then after that it increases linearly.

Why is this? Concurrency. Modern GPUs run their operations concurrently (and are, actually, slower than CPUs on a per-thread basis).

When we think of “calculating inference for an example with a model”, we typically think of the model as a single block, when, of course, it’s made up of many matrices. When we run inference, each matrix is loaded into memory. Specifically, each block of the matrix is loaded into on-device memory, namely the shared memory unit (only 192kb big on an A100). The block is then used to compute the results for every element in the batch. Note that this is not the same as GPU RAM, i.e. HBM. An A100 has 40GB or 80GB of HBM depending on the model, but only 192kb of on-device memory. This creates a memory bandwidth bottleneck when performing mathematical operations, as we are constantly moving data in and out of the on-device memory. We can approximate the time it takes to transfer the weights by calculating the model size / memory bandwidth ratio, and approximate the time it takes to do the calculation by model FLOPS / GPU FLOPS.

With an MLP, the FLOPS are approximately 2 * the number of parameters * the number of elements in the batch (2 * m * n * b for a batch size of b and a m x n matrix). As a result, the transfer time is equal to the calculation time when

Note that we can cancel out the number of parameters here:

And rearrange in terms of the batch size:

When the batch size is less than the ratio of FLOPS to memory bandwidth, we are bottlenecked by memory bandwidth. When it is more, we are bottlenecked by FLOPS. Note that this analysis is for a MLP, not for a convolutional network, like ResNet50. That gets a bit trickier.

On a T4 GPU (datasheet), we have 65 TFLOPS of fp32, and 300 gb/s of memory bandwidth, so the magic ratio should be 216. When we run a MLP (depth: 8, width: 1024), we see roughly what we’d expect:

There’s some noise, but it’s basically what we’d expect: inference time starts increasing dramatically around the ~128 mark (here, we double the batch size, so we see batches at 128, 256, and then 512). And, if we vary the width of the MLP layers, we see this is true across a broad variety of architectures (the following is a log-log plot, to fit everything in):

This is pretty cool! We can see the critical threshold across a broad variety of different architectures. What’s also interesting is that the smaller networks don’t really see any scaling, taking roughly constant time across the entire range of batch sizes (from 1 to 512). My hand-wavy explanation for this is that this is because GPUs are really, really fast when it comes to actually doing math, but everything else (CPUs, etc.) is kinda slow. We see a ton of noise at the start, which I don’t have a great explanation for (other than shrugging and saying “overhead”).

For many ML engineers, their time isn’t spent doing much machine learning, but rather it’s spent just getting rid of overhead, which is typically in the non-ML code. In reinforcement learning (RL) research, particularly for researchers who work on continual learning problems, where there’s a single agent taking a long stream of actions, it’s often not worth it to use a GPU for experiments unless either 1) you have a very large network or 2) you extensively optimize every other aspect of your stack (if you want to make an old DeepMind engineer squirm, ask them about in-graph environments— at one point, we were implementing RL environments within the tensorflow graph).

What about convolutional networks?

In a convolutional network, the weights are equal to the number of filters times the filter size. For torch.nn.Conv2d , this is kernel_size^2 * out_channels. So if we have a (224, 224) image with a stride of 1 and a kernel size of 3, we apply the same filter 224 times. This means that, basically, for a convolutional layer, there’s much less advantage to batching, as we’re reusing the same weights many, many times. For a pooling layer, it’s basically linear in the number of pixels, as you’d expect.

What about transformers?

Transformers are basically just MLPs, so we can treat them as the same. They have an attention mechanism, obviously, but with a KV cache (which keeps the computed data around in memory), the time taken by attention is minimal. I wrote about this a lot previously.

The same is true for a Mixture of Experts model. In many transformer implementations, the KV cache lives inside the attention class (e.g. MaxText is a great example). As the only difference between a MoE model and a vanilla decoder is that some of the feedforward layers are replaced with MoE layers, the KV cache will behave the same, as will inference, with one wrinkle.

The wrinkle is that the gating mechanism in a MoE layer will split the batch across experts, so if the gate doesn’t split the batch uniformly, this can cause problems. There are different routing mechanisms which avoid this (e.g. expert’s choice) but in autoregressive decoders, you’re pretty much forced to only use token’s choice, which has a tendency towards biased gates. Forcing the gate to evenly allocate tokens is 1) an area of active research and 2) an important goal that is optimized during training.

I hope this has been helpful— please reach out if you have any batching questions (or, more importantly, any corrections).

Where do LLMs spend their FLOPS?

Finbarr Timbers — Mon, 29 Jan 2024 16:28:04 GMT

Ok folks. I had a longer holiday break than expected thanks to some family illnesses. I’m still ~~lowering~~ adjusting my expectations for how often I can expect to be healthy, now that I have a toddler. Sickness. Lots, and lots of sickness.

This is a long article, so you might want to bookmark this and read it on your computer, not your phone.

Here, I conduct a theoretical analysis of LLM performance, and then profile an actual LLM to see where the empirical results differ from the theory. First, the theory. I will rely on the excellent blog post by Kipply to fill in the details. The basic takeaway is that for a standard decoder model, our FLOPS are allocated as follows (on a per-layer basis):

6 d^2 to compute QKV
2d^2 to compute the attention output matrix, softmax(Q @ K.T) @ V
16 d^2 to run the FFN

The sum is 24d^2 FLOPS. Percentage-wise, we spend 25% of the time computing QKV, ~8% computing the attention output matrix, and ~66% of the time running the FFN.

What about the attention mechanism? Well, as everyone knows, the attention equation is

With a KV cache (you are using a KV cache, right anon?), Q, K, and V are all d-dimensional vectors (equivalently, (d, 1) matrices). So it takes ~2d flops for each dot-product and d flops for the d divisions that happen, for a total of ~5d flops, rounding to nothing:

For d equal to 4096 (the value it takes in Llama7b) this is 0.005%, so nothing. This makes it seem like the attention mechanism doesn’t matter, but of course, we only use a KV cache (and flash attention, etc.) because it matters so much. Think of Milton Friedman’s thermostat analogy (h/t @bradchattergoon):

If a house has a good thermostat, we should observe a strong negative correlation between the amount of oil burned in the furnace (M), and the outside temperature (V). But we should observe no correlation between the amount of oil burned in the furnace (M) and the inside temperature (P). And we should observe no correlation between the outside temperature (V) and the inside temperature (P).
An econometrician, observing the data, concludes that the amount of oil burned had no effect on the inside temperature. Neither did the outside temperature. The only effect of burning oil seemed to be that it reduced the outside temperature.
A second econometrician, observing the same data, concludes that causality runs in the opposite direction. The only effect of an increase in outside temperature is to reduce the amount of oil burned. An increase in V will cause a decline in M, and have no effect on P.
But both agree that M and V are irrelevant for P. They switch off the furnace, and stop wasting their money on oil.

The KV cache does require O(T) memory (where T is the number of tokens we wish to generate), which ain’t cheap (see: $NVDA).

How big is the KV cache? Well, for each token, we store the following number of many bytes (the first 2 is because we assume bf16 precision, so 2 bytes per parameter, and the second 2 is because we have to store both the K and the V tensors):

Note that, by assumption, n_heads * d_head = d_model = d, so the number of bytes is 4 * the number of layers * d.

For GPT-3, we have 96 layers with a d_model of 12288, so we need 4.72 MB per token. It would thus require 5.6GB to generate 2048 tokens.

Having said this, to generate a sequence of a given length with a given model, we still need to use the exact same amount of memory as the KV cache requires, we just throw it away at the end of each forward pass. So we don’t need more memory. In a sense, the KV cache is free (modulo some tedious bookkeeping, at least in Jax).

How does this change for more modern architectures, like Mistral 7B? Mistral 7b uses Grouped query attention (as does Llama2— almost as if there’s an overlap in authors or something…) and sliding window attention.

In GQA, you share the KV projection across multiple heads, either a single KV projection across all heads (MQA) or into multiple groups (GQA). These are all equivalent to standard multi-head attention (MHA) with a smaller d_model. Earlier, we did the KV cache calculations assuming that the number of heads * the head dimension is equal to the model dimension, but for MQA/GQA we relax that assumption. As the KV cache formula is

we change this to be

where the number of heads * the head dimension is the effective model dimension. Thus, we see a linear decrease in the KV cache size as the number of KV heads decreases (one of the key motivating factors behind GQA/MQA).

The parameters of the Llama{1,2} models are given by:

So for Llama 2, the KV cache required per token is:

Without GQA, the 34B model would take 5x as much memory for the KV cache, and the 70B model would take 8x more memory.

Sliding window attention, another one of the Llama/Mistral tweaks, guarantees that we can cap the KV cache at the window size, which is 4096 for Llama7B.

Performance motivated architectural changes

As discussed, above, a LLM uses 24d^2 FLOPS per layer. Increasing the number of layers linearly scales the number of flops, and the number of parameters. Increasing the model width quadratically scales the model size. Note that this is because the number of parameters scales quadratically with d_model, as most of our layers go from a d_model input vector to a d_model output vector, so we have weight matrices that are (d_model, d_model) in size. Another way of putting this is that compute scales linearly with the number of parameters, and increasing d_model increases the number of parameters quadratically. Making the model 2x deeper doubles the parameters, but making it 2x wider quadruples the parameters.

Having said this, one advantage of a wider model is that it parallelizes better. To compute the Nth layer, you must first compute the preceding N-1 layers. This is difficult to parallelize efficiently, particularly during training, while it is much easier to split a single layer across GPUs via tensor paralellism. If you care mostly about latency, then you probably want to bias yourself towards a wider model.

Empirical analysis

I did this analysis using Colab (notebook). Here’s the high-level profile for a single forward pass (interactive profile on my website):

We see that 4.88% of the overall time from this run was spent within this single forward pass. Of the forward pass, 1.98% is spent in attention, while 2.58% is spent in the MLP. Of the total time spent in the forward pass, 40% of the time is in the attention layer, and 53% in the MLP. Within the attention layer, the time is being spent on 4 different linear layers, 2 of them taking approximately equal time (linear_1, linear_2), one of them taking 50% more (linear_3), and one of them taking twice as long as the first two (linear_0). My guess is that the linear_0 is calculating the Query embedding, while linear_1/2 are calculating the Key and Value embeddings. Note how much quicker the calculation is because of the smaller number of KV heads! GQA makes a tangible difference, even though the attention mechanism being used (xformers.ops.memory_efficient_attention) requires that the QKV embeddings be broadcasted to the same size.

If we go back to our theoretical analysis, it predicted that 2/3rds of the time would be calculating the FFN, and 1/3rd on calculating attention. That’s roughly in line with what we see; we spend slightly more time calculating attention than the MLP, but I suspect that’s because the MLP is executing a very well-optimized path for Torch.

Performance changes

I then ran a number of experiments with Llama2 where I swept over the model width and depth. These are the results:

This is really interesting. We see basically no change in speed for the two models with a hidden size of 1024 and 1536 (1.10s vs 1.11s), and only a minor one (1.15s vs 1.10s) for the 1024 vs 2048 model. However, when we compare the models with hidden dimensions of 2048 (1.15s), 3072 (1.41s), and 4096 (1.82s), we start to see what looks like linear scaling!

My explanation is that there’s non-trivial overhead from dispatching the kernels and actually running the matmuls. This was run on a T4 (specs), which, although dated by modern standards, still has 65 TFLOPS of bf16 compute. If we multiply two 1024x1024 matrices together, that requires 1GFLOP of compute, so we can (theoretically) multiply 65000 1024x1024 matrices together per second. In practice, we’d only get 60-80% of that, but that’s still 40000 matmuls per second. A lot of this advantage comes the massive number of cores that modern GPUs have. A T4 has 2560 CUDA cores, each running at between 585 and 1590 MHz. As a result, any task that can be parallelized will do well, but those that require sequential calculation will not be as optimized. I think that’s what we’re seeing here- there’s not enough parallelism to actually occupy the GPU.

The transformer depth causes performance to behave exactly as you’d expect: inference time increases linearly with depth. There’s some noise when it comes to the deepest models, but it’s pretty well-behaved.

I then calculated the cost as we generate more tokens (I did 10 runs for each number of tokens, to reduce the noise):

It’s exactly linear, as you’d expect, because Llama2 uses a KV cache. If we look at the reserved memory, we see the KV cache working as expected (somewhat):

We see that the model has a jump of ~2.1MB every 20 tokens. As this model has d_model of 1024 and 8 hidden layers, it needs 4 * num_layers * d_model bytes of memory, or 4 * 8 * 1024 bytes = 32KB of memory per token. We should only need 640KB of memory. It’s unclear where the extra 3x overhead comes from. I suspect the answer is an inefficient implementation.

The evolution of the LLM API market

Finbarr Timbers — Wed, 13 Dec 2023 15:15:08 GMT

Before I studied machine learning, I was an Econ grad student banging out OLS problem sets (I see the OLS equation— (X’X)^-1X’y— whenever I close my eyes, I derived it so many times). My research area was antitrust theory, and in particular, vertical integration. That gives me a unique perspective: how will the LLM API market evolve as more companies enter the space?

The market began, famously, with OpenAI releasing ChatGPT and rapidly hitting $1.3B in revenue. At this time last year, however, there was basically no competition in the LLM API market. Bard was yet to be released, let alone Claude, and Gemini was a mere twinkle in Sundar’s eyes. OpenAI had a monopoly in the market, letting them capture basically all of the value.

In the year since, what we’ve seen is that there doesn’t appear to be a moat in LLMs except at the highest end. GPT-4 is the only model which doesn’t have competition, and there are competitors sniffing around— Gemini Ultra, Llama 3, and the as-yet-unreleased mysterious Mistral model bigger than medium. At the GPT 3.5 level, however, you have many options for hosting, and you can even host it yourself. This necessarily limits the prices any company can charge.

Generally speaking, companies enter a new market when they think they can make a profit above the minimum threshold they require. The larger the company is, the smaller the profit threshold they require. If I, an individual, were to start offering a service to finetune LLMs, I would need to charge a fairly high margin at first, as I would have a small customer base to spread the costs over. As my company grows, I would have a larger customer base to spread the costs over, and would have more money to spend on optimizations enabling me to serve LLMs for cheaper:

Quantization
Buying your own chips instead of renting them
Distilling models
Building your own chips

with each optimization that you do to make your own process more efficient, you increase your margin. That’s great! You make more money per token. Right? Well, not quite. In a vacuum with a spherical cow, you do. But just as you invest in your ability to serve tokens more efficiently, your competitors are all doing the same, eroding your margins. To do a bad Ben Horowitz impersonation, You run this hard just to stay in place.

The necessary implication is that the undifferentiated LLM market will become a ruthless competition for efficiency, with companies competing to see who can demand the lowest return on invested capital.

In the classic business strategy book, the Innovator’s Dilemma, there lives what is the canonical example for how technological disruption happens (this is taken from the New Yorker profile on the author, Clayton Christensen):

In the world of steel manufacturing, historically, steel was made in massive integrated mills. They made high quality steel with reasonable margins. Then came along electric mini mills. These mills were able to make the lowest quality steel at a cheaper cost. The large steel manufacturers saw this, shrugged, and focused on making high quality steel at a (relatively) high margin. Over time, the electric mini mill operators figured out how to make higher and higher quality steel, moved upmarket, and killed the massive integrated mills (US Steel— once the 16th largest corporation by market cap in the US— was removed from the S&P 500 in 2014).

Get 20% off a group subscription

The analogy to LLMs is straightforward. The large labs focus on making the highest performing models. They are expensive, but excellent, and outperform every other model. However, they are expensive. You need margin to pay for all of those $900k engineers! Even then, however, we see competition on price. Gemini Pro

At the low end, we have the open source community, led by Meta and r/LocalLlama, which are cranking out high quality models and figuring out how to serve them on ridiculously low powered machines. We should expect to see the open weight models improve in quality and decrease in cost (on a quality adjusted basis), putting pressure on the margins of the largest labs. As a real-time example, Together came out with a hosted version of Mixtral that is 70% cheaper than Mistral’s own version.

We should thus expect a bifurcated market. At the high end live more expensive, higher quality models, and at the low end, lower quality, less expensive models. For open weights models, we should expect their price to converge to price of GPUs + electricity (and as competition increases in the GPU market, perhaps just to the price of electricity).

The question, then, is what does the buyer for these APIs look like? If we were to rank the economically valuable tasks that LLMs can perform from most complex to least complex, how many of the tasks require high end complexity? At some point there’s a threshold where GPT-4 is required, but it’s hard to image that the threshold will remain static. The open weight models will continue their inexorable climb up the list, biting at the margins of the large labs. As tooling makes it easier to effortlessly switch between model APIs, the developers using the API will switch to whatever the lowest cost model is that accomplishes their task. If you’re using a LLM for, say, short-length code completion, do you need the biggest and best model? Probably not!

Moreover, the companies with the biggest success in the consumer marketplace will inevitably start to balk at paying a significant amount of their profits to another company, and will start to train their own models. We see companies like Harvey, and Cursor, which were some of the companies with the earliest access to GPT-4, start to hire research scientists/engineers, giving them the talent required to train their own foundation models. As API fees are probably the biggest expense for these companies, it seems natural that they will do everything they can to reduce their costs as much as possible.

If you’re building your own models, you can go out and raise a round of investment to invest in your own models, trading off a one-time capital expenditure to increase your overall margins. This is the justification for Google’s TPU program, for example. By spending billions of dollars on custom silicon, they’re able to avoid paying Nvidia’s Danegeld.

The conclusion, then, is that we will see the market for LLM APIs converge to one of lowest cost as long as your task is simple enough to be solved by open weight models. If your task is so complex that it requires the best model, you’re stuck paying OpenAI. For everyone else, there’s finetuned Mistral 7B.

The evolution of the LLM API market

Finbarr Timbers — Tue, 12 Dec 2023 16:30:29 GMT

Note: if you’re coming to this post online, this is the same as the free post, I ran into issues opening this article up on Substack.

Transformer inference tricks

Finbarr Timbers — Thu, 23 Nov 2023 16:26:35 GMT

Transformer inference tricks

Special thanks to @cis_female for discussing the intricacies of sparsity with me, and @nostalgebraist for correcting an error in the quantization section; I now think that the evidence shows that quantizing, at least to 4 bits or more, has a very minimal tradeoff in terms of performance.

I’m going to discuss a number of optimizations that can be done to make inference for transformers either faster or more efficient.

KV Cache

By far the most common (and most important) optimization for a decoder is a KV cache. In a decoder model, the keys and values will be identical for the prompt for every iteration of decoding. Moreover, once you’ve ran a token through, the keys and values will be the same for that token for every subsequent iteration. As a result, you can cache the prompt, and incrementally add the KV tensors for each token to the cache as they are decoded. Doing so removes a lot of compute. Inside the attention mechanism, we’re able to go from multiplying two tensors of shape (batch, context_length, feature_dim) to multiplying a query tensor of shape (batch, 1, feature_dim) with your KV tensors of shape (batch, context_length, feature_dim). Consequently, sampling is no longer quadratic in complexity, allowing us to get decent decoding (sampling) performance with longer context lengths.

In practice, this causes added complexity inside your implementation, as you now have state rather than just running pure functions, so you have to keep running inference for the same sequences, even if one is done (see, eg, the Google MaxText implementation).

The KV cache requires 2 * n_layers * n_heads * d_head parameters. For GPT-3, with n_layers = 96, n_heads=96, d_head = 128, this would require 2.4M parameters for every token in your context. With typical 16-bit precision, this would require 5MB per token; if you have a context window of 2048 tokens, that’s 10GB of HBM dedicated to your KV cache. Expensive, but not outrageous. And well worth every GB.

These memory requirements are a big part of the reason why it’s so hard to use consumer grade GPUs for LLMs— the most powerful consumer card is the 4090, which has only 24GB of HBM. It has FLOPS that are comparable to the enterprise grade chips, but the memory limits are much lower, making it difficult to fit the weights and the KV cache into memory.

Speculative decoding

Speculative decoding is a technique that is used when you have excess compute capacity, typically in the local inference setting. It exploits the property of modern accelerators whereby it takes the same amount of time to run inference on a batch of data as it does to run inference on a single datapoint. For an A100, for instance, you can run inference on up to 160 datapoints in the same amount of time as a single datapoint. As a result, many techniques have cropped up to exploit this, such as beam search, MCTS, or speculative decoding.

In speculative decoding, one has two models: a small, fast, one, and a large, slow one. As the inference speed for a modern decoder is directly proportional to the number of parameters, with a smaller model, one can run multiple inferences in the time it takes a large model to run a single inference.

In modern decoder models, like the GPT family of models, they use autoregressive sampling, whereby to sample a sequence of N tokens, the model runs inference N times, each time consuming the result of the previous inference.

In speculative decoding, you run two models in parallel. The fast one runs a batch of inference and guesses which tokens the big model will predict. It compounds these guesses. In the meantime, the big model is running in the background, checking that the smaller model recorded the same results. The small model is able to make many guesses in the same time that the big model is able to make one. However, given that we have spare compute capacity, the big model is able to evaluate all of the guesses in parallel. As such, the only place where we pay the sequential cost of generating a sequence is for the smaller model.

The major disadvantage to speculative decoding is that it requires a “draft” model that is able to predict the output of the larger model, and you have to have both models in memory on the same machine (or on the same node in a multi-GPU setting). This adds complexity, and requires additional work, as you have to be training two models (the original one, and the “draft” one). Moreover, any performance benefits are limited by how accurately the smaller model is able to predict the larger one. If the smaller model was consistently able to predict the behaviour of the larger model, we’d just use it! Consequently, there’s a fundamental gap in how well speculative decoding can perform. HuggingFace has claimed that it typically doubles the decoding rate, which is consistent with the original paper, which claimed a 2x-3x improvement.

A technique recently came out which tries to improve on this by having the model generate n-grams, and recursively match them, without requiring a draft model. There’s a technique called Jacobi decoding (figure taken from their blog) which is a potential improvement over greedy decoding. How it works is that, at every point where you generate a token, you generate n tokens, making a “guess” as to the entire sequence. Then, you verify this against your previous guess; if the two match, then you accept the guess. This can enable latency improvements with no downside, as in the worse case, it devolves into greedy decoding.

Lookahead decoding improves on this further by keeping the n-grams that have been generate through the decoding process, and trying to use them as guesses. Given that there is a high correlation between the text that has been generated and the text that will be generated, this also has the possibility to improve latency dramatically, with minimal cost. It’s a very clever trick. I’m unaware of anyone using it, given how the technique was announced yesterday; very curious to see how it performs in real world scenarios.

Effective sparsity

In a decoder transformer, the beating heart of the model is the attention mechanism, summarized in the attention equation:

The softmax operation makes values that are not the max really small

Consequently, we are multiplying the values tensor (V in the attention equation) by a tensor that is mostly zeros. As a result, the output of the attention mechanism has a lot of zeros— up to 97% (h/t @yesthisislion). Similarly, after each ReLU in the MLPs, we also have a lot of sparsity.

Now, unfortunately, it’s kinda tough to actually make use of this. If we have sparsity in the weights, there’s a lot of work that can be done there through structured sparsity (e.g. torch.sparse), but it’s not actually clear how much current systems are able to make use of activation sparsity.

One optimization that can be done, is that if an activation is zero, you can just skip loading the weights that correspond to that activation, and skip the corresponding computation. This isn’t really supported in mainstream tensor programs as far as I can tell, but for a custom inference implementation, like, say, Llama.cpp has, it would be easy to implement.

The reason for this is that the activations are a function of each token, and thus, so too is the effective sparsity, causing it to be randomly distributed over the tokens. As a result, the effectiveness of this decays exponentially with batch size. If we have an effective sparsity of X% and a batch of size N, the likelihood that all entries for a given activation will be zero across the batch is given by X^N. I have a table for varying values of X and N. The decay is dramatic!

As a result, it’s tough to make use of this except in the batch size 1 regime, and even then, it’s typically more useful to use speculative decoding. But if you’re trying to run inference locally, and really need to get your latency down, this can be a great trick to use.

Quantization

Quantization is one of the better known tricks. I wrote about it before, so I’m not going to spend a ton of time on the actual methods. It’s tough to quantify how well quantization works. Much of the literature, such as the GPTQ paper, was done with models that aren’t close to SOTA, as the big labs aren’t publishing, and academics can’t match the resources that the big labs have.

For instance, GPTQ reported results quantizing the OPT & BLOOM models, which are much worse than the current crop of open source models, let alone GPT-4.

Of course, the big labs aren’t reporting what they’re doing, and most of the anecdotal reports I’ve seen are from people who are trying to run smaller models on consumer grade hardware, which is extremely memory limited. I think that a lot of hobbyists (i.e. people who don’t work as researchers at big labs) are blinded by the appeal of running a really big model locally, so they get really excited about quantization. But there’s no intrinsic advantage to quantization! From a first principles perspective, if you have two models that have the same number of bits, they should have the same number of tokens/s, and should have a similar level of performance. There would only be a big difference if we were doing a terrible job of using the bits in higher precision formats.

The literature doesn’t agree with my intuition— the aforementioned GPTQ paper found a negligible decrease in performance from quantizing models to up to 4x lower precision. I think that an explanation for this is that it’s much easier to quantize worse models without sacrificing performance. If we consider two identical LLMs, one trained with 2 trillion tokens, one trained on 500B tokens (call them LLM-2T, LLM-500B), I think we should expect the one trained with the higher number of tokens to suffer more when quantized, as it should be making better use of the tokens. We’d still expect the quantized LLM-2T to be better than LLM-500B, but I expect the performance decrease to be bigger from LLM-2T to quantized LLM-2T rather than LLM-500B to quantized LLM-500B.

Note: While I find the above argument compelling, it’s not at all supported by the literature. Quantizing does appear to be pretty darn close to a free lunch.

More recent work, like the *k*-bit inference scaling laws paper, ran an incredible number of experiments across a family of LLM architectures, reporting how allocating your bits differently affects performance. They studied the tradeoff between having a model with N parameters at a given level of precision vs having a model with 2N parameters and half the precision. Their results were pretty compelling, being almost indistinguishable from no penalty for quantizing (at least for 4 or more bits):

They found, basically, that you can go down to 4 bits without any penalty. There is almost no tradeoff from quantizing! You can run a 4x smaller model without a significant drop in performance. As inference performance on modern accelerators is equal to the number of bits you process (i.e. you can get Nx more operations per second when using Nx less precision), this is great.

My conclusion then, such as it is, is to use the recommendations from the k-bit inference paper. However, I’m hesitant to recommend using precision that’s lower than 8-bit for production workloads. fp8 is the lowest precision level floating point format that is supported natively by modern accelerators, and even then, support is limited. I would train and run inference in fp8, and see if the tradeoff in accuracy from quantizing further is acceptable for your usecase. I would struggle to recommend running a lower level of precision in a production environment when it doesn’t have native support from the platforms (i.e. Nvidia and the Torch/JAX teams).

As far as I can tell from the literature (which matches my intuition), fp8 is strictly better than int8, but it has limited support in hardware. If you’re at a GPU rich organization and get to use H100s for everything, use fp8. Otherwise, int8 is fine, and is much easier to use— PyTorch makes it quite easy (although the APIs are unstable).

When it comes to actually quantizing your model, the PyTorch team has a writeup of how to actually do this, and they provide a bunch of APIs to make it easy, although they’re unstable. bitsandbytes is another excellent library for quantization, although I haven’t used it personally.

Why do LLMs use greedy sampling?

Finbarr Timbers — Tue, 17 Oct 2023 15:37:14 GMT

Why do LLMs use greedy sampling?

This is a speculative article, so I’d greatly appreciate any feedback, particularly if you disagree.

When I began working on LLMs, I found it pretty surprising that the SOTA in generative text was to greedily sample from the bare outputs of the neural networks. In other words, with GPT-style models, the typical approach to generate a sequence of text is something like:

Run your prompt through your model and generate probabilities over your vocabulary.
Choose the most likely token (perhaps with some randomization, maybe with some preprocessing, like top-k or nucleus sampling).
If the chosen token is <|endoftext|>, you’re done; otherwise, concatenate the new token to your prompt and go back to 1.

In games RL research, it is common to instead conduct a much more complicated calculation to choose the next step in your sequence. For instance, AlphaZero uses a somewhat complicated algorithm called Monte Carlo Tree Search (MCTS). Here, I explore some reasons for why LLMs don’t use a fancier decoding algorithm.

But first, a caveat: there’s a lot of literature proposing various ways to do this that I’m not going to engage with, for sake of time. I have a list of references at the end which I’d encourage you to look at if you want a more detailed look.

The current paradigm of language modelling, with GPT-style decoder models, uses greedy autoregressive sampling to generate a sequence of tokens. This is a somewhat surprising choice; if you look at the history of NLP research, particularly the Neural Machine Translation literature, beam search is often needed to reach SOTA performance (e.g. 1703.03906). Similarly, in games research, search is typically many times stronger than any pure neural network approach, and search will strictly dominate wherever it’s feasible (the exceptions are games like Stratego, where the game tree has much too high of a branching factor to be searched with any non-trivial depth). In games like Go, Chess, Poker, or Scotland Yard, search methods dominate.

By search, I am referring to algorithmic search, which I am defining as any method which uses additional compute at inference time to improve the answer. This has nothing to do with Google-style search (which I call “information retrieval”).

So why don’t GPTs use search? Well, there’s a few answers to this. The first one is a total copout:

GPTs don’t use search as far as we know. OpenAI recently raised my eyebrows when they hired Noam Brown, an expert on search in games, to work on “multi-step reasoning, self-play, and multi-agent AI.” That sounds an awful lot like search to me (and, specifically, sounds a lot like Alpha/MuZero). We also know that Demis has talked about Gemini incorporating techniques from AlphaGo, which, again, makes me think about search (not to mention self-play).
So it’s entirely possible that search is the secret sauce behind GPT-4’s performance, and the lack of it is why the open source world has been unable to match it. I’m suspicious of this— for reasons I’ll get into below— but if I were actively working on LLM research, I’d be focusing on trying to use search.

Let’s assume, then, that the key players (GPT-4, Claude, etc.) aren’t using search. Why?

More on Mixture of Experts models

Finbarr Timbers — Thu, 07 Sep 2023 14:38:49 GMT

This is a follow up article to one I wrote at the beginning of August. That article was a discussion of Mixture of Expert models, and dove into how they work at a high level. After writing it, I had a lot of people suggesting more MoE papers to read, so I decided to do that. This article is a summary of 6 papers that explore different routing techniques, and cover a broad swath of the research landscape.

Routing techniques

There are several broad families of routing techniques:

Fully differentiable techniques, where the routing layer is but one layer in the network (although typically much smaller than the actual transformer blocks), and
Linear programming techniques, where the routing layer solves an isolated optimization problem.
Non-parametric approaches, e.g. hashing, which don’t solve or learn anything. These are mostly used for benchmarking.

Most research that focuses on high performance (as in, models that are good at modelling language) focus on either #1 or #2. The trend recently has been to focus on fully differentiable techniques, but it’s not clear to me how much that’s supported by the results, vs a decision that’s supported by the bias that most researchers have towards end-to-end learning (which, itself, is supported by research results from other domains, such as deep RL).

There are six papers that I will discuss in this post:

RL routing, which focuses on the underlying theory allowing us to back-propagate through stochastic neurons, but which doesn’t propose a specific method. This is included because it’s cited by most subsequent papers.
Non-parametric hash layers, which randomly comes up with a fixed token ↔ expert assignment once, and uses that indefinitely. The hash layers serve as a strong baseline, but probably shouldn’t be used in production.
BASE, which solves a linear assignment problem to allocate tokens to experts.
Differentiable Select-K, which is a fully differentiable version of the standard sparsely gated MoE layer.
Expert choice routing, which reverses the standard “token choice” framework where tokens choose the top expert for them to be routed to, and instead has each expert choose the top K tokens for it to see.
Soft MoE, which sends all tokens to every expert and combines the results with weights which are learned on a per-token basis.

This isn’t an exhaustive literature review, but is an attempt at summarizing the papers that one would want to read to get close to the state of the art if one were planning to implement a MoE system. If you haven’t read my previous MoE post, I’d encourage you to do that first.

RL routing

Abstract

This paper is ancient (it’s from 2013!) but provides a lot of the background for using RL to route MoE models. They examine a question of how to estimate the gradient of a loss function with respect to the input of stochastic/hard non-linear neurons, as we can’t naively backprop through them. This is a problem which arises in the MoE work— see, for instance, the soft MoE paper later on, which has a solution to this as a key selling point.

The core idea is that it is possible to obtain estimated gradients by introducing small random perturbations in the system and observing the effects— the definition of the derivative!

We can numerically approximate this! Doing this naively is inefficient, as independently perturbing N parameters is Nx more expensive, but if we randomly introduce stochasticity in a way so that gradients can sometimes flow, we can then backprop in the standard way and adjust the neurons accordingly. The example is dropout— it also introduces a hard non-linearity with some probability, but we can still backprop through it without issue when the parameter is not dropped out. In the paper, they consider equations of the form

where a_i is an input, typically the output of other neurons, and z_i is a random variable. They consider 3 options:

The noisy rectifier:
The STS unit (Stochastic Times Smooth):

where f is some activation function (they use the sigmoid),

The Straight-Through Estimator, where you back-propagate through the hard threshold function as it was the identity. For instance, if we have the function

sigmoid(a_i)}","id":"NQBPOTAMSC"}" data-component-name="LatexBlockToDOM">

then using the ST estimator we’d set the gradient as

The ST estimator works quite well in practice:

It significantly outperforms the other stochastic methods when used to train a model to classify MNIST. It’d be nice to see more experiments to validate these models better; MNIST isn’t particularly compelling now, although this does date back to 2013.

The paper is interesting mostly for context into how we can actually train routing layers for MoE models.

Non-parametric Hash layers

Abstract

This is a paper I’m revisiting, as I had previously read it while reading Unified Scaling Laws for Routed Language Models. Intuitively, it doesn’t make sense to me that a non-parametric approach would perform comparably to a learned approach, so I find this paper compelling, as it demonstrates that a simple non-parametric model can approach the performance of learned models. This mirrors my experience as an AI poker researcher: UniformRandom (literally just picking a random bet) was a surprisingly hard benchmark for our learned agents to beat, and in certain subgames of e.g. Scotland Yard, was optimal. I continually found this fact surprising.

The method itself is quite simple:

In it, the tokens are hashed into a fixed number of buckets, each corresponding to an expert. The routing function uses the original input token rather than the hidden state, so there is no learning happening in the routing layer. As a result, the routing mechanism is fast, deterministic, and robust.

The authors consider a wide variety of hashing functions:

Random Hash, where they build a lookup table assigning every token to a fixed, random expert.
Balanced assignment, in which they build a lookup table which greedily assigns the most frequent tokens to the emptiest buckets. This is more balanced than Random Hashing, but still not perfectly balanced.
Bigram Hash, which hashes the current and previous token,
Previous Token Hash, which uses the previous token only,
Position Hash, hashing based on the position in the sequence
Oracle Future hash, used as a baseline, which hashes the next token,
Predicted Future Token Hash, which predicts the next token, and hashes over the prediction
Clustered Hashes, in which they obtain clusters using k-means over token embeddings from a baseline model, and assign experts to clusters
Dispersed hashes: which assigns similar tokens to different buckets, by using the same k-means clusters as Clustered Hashes, but distributing all tokens within each cluster equally across buckets.

A MultiHash layer is also used, in which the authors take N hashing functions, split the token embedding into N segments, and use each function to allocate each embedding segment to a different expert.

The hashing results are surprisingly good:

Their ablations also reveal that the sparse models they try (Hash/Switch) outperform the dense Baseline model they train, as well as the Wider Transformer (755M params). But the Deeper Transformer (also 755M) they train is better than the sparse models. This is interesting; a bunch of the scaling laws paper found that network architecture didn’t really matter, which isn’t consistent with my experience, so it’s nice to see that validated in the experiments here.

They also compare to BASE layers (which we will discuss later in this article), as shown in the plot on the right, and show an improvement. I’m surprised by this; I would have expected BASE layers to do strictly better, as they’re solving a linear programming problem. My suspicion is that the expert embeddings they use aren’t very good. The Hash Layers are significantly more performant than the BASE layers, as BASE requires two all-to-all communications, which are expensive.

In their ablations, they find that increasing the number of routing modules increases the advantage that Hash has over Switch. Their hypothesis is that with a small number of routers, learning to route is more important, but this advantage goes away quickly.

They experiment with all of the different hashing methods, and find that Balanced Assignment is the best (other than the Future token Oracle):

The other one that is roughly as good is Dispersed Hash, which randomizes tokens within each cluster across all the experts, making it effectively a more complicated randomization. This makes sense; from what I’ve seen reading routing papers, the most important characteristic is that tokens are balanced across experts so that the experts are trained well.

They perform a comparison on Wikitext-103, which finds that with a smaller BPE dictionary, the Hash layer does better than a Switch Transformer:

I suspect this is due to token balancing (or rather, unbalanced tokens). With a smaller BPE dictionary, I suspect the switch transformer struggles to balance the tokens, which the hash layer does by construction.

When they add in multiple hash functions, performance slightly increases, but not at a level that seems worth the complexity to me:

Generally, I’m surprised by this paper. It’s remarkable how well the method does. This should be used as a baseline for routing papers.

BASE

Abstract

This paper introduces an algorithm called balanced assignment of experts (BASE) which focuses on formulating the problem of allocating tokens to experts as a linear assignment problem, which allows to use linear programming to find an optimal solution that guarantees each expert receives an equal number of tokens. By using a classical optimization approach, the method doesn’t introduce any new hyperparameters or auxiliary losses, and thus doesn’t add any complication to training. An implementation was released in Fairseq, FAIR’s sequence modeling toolkit.

In BASE, a single expert is assigned per token, in a way that maximized token-expert similarities. Expert specialization is learned by training a modified residual connection that mixes in each expert.

During training, the authors maximize model throughput by assigning an equal number of tokens to each expert. At test time, they assign each token to its highest scoring expert (”tokens choice”). They solve the following problem to assign tokens to experts (where we have T tokens, h_t is the representation of token t, we have E experts each with an associated embedding w_e, and we have an assignment index assigning tokens to experts 0 ≤ a_t ≤ E):

This objective function seeks to maximize the similarity of the token and expert embeddings while respecting the constraint that all experts receive an equal number of tokens.

To minimize the computational cost to solve this problem, which requires solving this for ET tokens across all the workers, the problem is decomposed by having each worker solve a separate assignment problem over the inputs it receives. Each worker then sends T/E tokens to each other worker. However, this is heavily correlated, because the tokens assigned to each worker during training are typically from the same domain. To enable specializing, they add a random routing step, where each worker sends an equal number of each tokens to each other worker randomly. In effect, the algorithm has three steps during training:

Each worker sends T/E tokens to each other worker randomly.
Each worker solves a separate linear assignment problem and routes T/E tokens to each expert.

At test-time, the workers simply assign each token the best expert. The hope is that the workers have learned a reasonably balanced assignment.

They find that their approach outperforms other sparse MoE models:

They also find that their approach matches the Sparsely Gated MoE model despite being simpler.

When it comes to compute efficiency, the best is (unsurprisingly) data parallel, but BASE is the second fastest approach due to the lower amount of communication needed between workers.

Routing layers that solve linear programming problems, like BASE, seem like a strong approach. They seem to have fallen out of fashion, which I don’t understand. I think that more people deploying MoE models would do well to consider LP-based approaches.

Note: This paper was built on with the Unified Scaling Laws for Routed Language Models paper, which proposed a variant: Sinkhorn-BASE. I won’t discuss it here, but it’s worth reading. It has a better matching step than BASE and, as a result, slightly improved performance.

Differentiable Select-K

Abstract

This is a continuously differentiable sparse gate for MoE routing. The standard MoE routing problem is to select the best k out of n experts for each token. This is a constrained optimization problem, which isn’t optimized for the accelerators that handle contemporary ML workloads. This paper reformulates the problem to use an unconstrained formulation that is equivalent to the original problem. The reformulation uses a binary encoding scheme to implicitly enforce the cardinality constraint. By using a binary encoding, the number of parameters used by DSelect-k is logarithmic in the number of experts, while existing gates (e.g. the Sparse MoE gate) are linear. This could be useful with techniques like soft MoE (discussed later) which have a massive number of experts.

The authors propose two varieties: per-example, in which each token chooses an expert, and static, in which a weighting of experts is chosen once and does not vary per input. The static routing is not standard and is rarely used in the literature; it is more analogous to an ensembling technique. The only other time I’ve come across a similar problem is in the random hashing paper.

The authors compare DSelect-k to Top-k. The Top-k gate is defined by

where the TopK function is equal to the identity for the top k elements, and -∞ for the rest. While not continuous, Top-k allows for gradient propagation for the TopK outputs (using the “straight-through” method introduced in the RL routing paper discussed earlier).

The authors conducted an auxiliary experiment to show why continuity is a desired behaviour. In it, they used a MoE model to generate synthetic data, and trained routing layers to learn which experts were used to generate which data. They find that DSelect-k is able to recover the true experts, while TopK is not; additionally, the weights chosen by DSelectK are much more well behaved, while those from TopK exhibit a weird oscillatory behaviour, which the authors attribute to the discontinuous nature of the TopK router:

The authors conduct training on the MovieLens dataset, with two tasks: 1) a binary classification problem predicting whether a user watches a particular movie, and a regression problem to predict the rating a user assigns a given rating. They plotted the expert weights during training, and TopK has much higher variance in the weights, where the weight assigned to a given expert abruptly changes:

What I find interesting about these results is that these auxiliary metrics show a large degree of variance, which one would expect to be quite harmful in the final performance. However, DSelectK does not radically differ in final performance compared to the other routing functions.

That surprises me; given the high degree of variance induced by the discontinuous nature of the function, I’d expect to see more instability. Perhaps this is due to the fact that the authors don’t train a LLM; it would be interesting to see an ablation where the authors train a LLM. I would expect fewer loss spikes and more stability during training from a smoother function like DSelectK. That would be prohibitively expensive, as all things LLM are.

Expert choice routing

Abstract

A one sentence summary of this paper would be: “Learns the greedy top-k tokens per expert.”

Load imbalance is a major problem in MoE models. Often, we see the top experts get more tokens than the rest, which tends to compound over time. As a result, many papers dedicate a lot of time & effort to balancing the load across experts.

This paper seeks to address that by having experts select the top-K tokens for them to see, rather than having tokens select the top-K experts, as has been previously done.

This has the advantage that tokens can be seen by a variable number of experts. Their model outperforms the T5 dense model in 7/11 tasks; I would like to see stronger performance, as my bias is that a model that is exactly as good as T5 would outperform it in 5.5/11 tasks (50%), and 7 is not significantly larger than 5.5.

Generally speaking, I’m suspicious when any method demonstrates an improvement over an existing method, because most researchers tend to tune their new method better than the old one. This is human nature; “it is difficult to get a man to understand something, when his salary depends on his not understanding it.” As such, I want to see the new method be heads-and-shoulders better than the previous one if they’re claiming improvement. My bias is that, with access to the amount of GPUs that a typical Google researcher has, I could show an improvement in most deep learning techniques simply by doing more hyper-parameter tuning. If I had the ability to add an arbitrary technique to it, that introduced more hypers, I could do better still.

Having said that caveat, the paper does intuitively make sense: in most token—choice routing models, each token is seen by exactly k experts, and thereby uses the same amount of compute. This is sub-optimal! We’re effectively offloading all compute allocation decisions to the tokenizer, which is often extremely simple (e.g. byte pair encoding). Methods which can allocate more compute to certain tokens should be significantly better.

An advantage that this technique has is latency: in token-choice methods, where each token chooses the top-K experts to be routed to, some experts will receive more tokens than others, causing step latency to be bottlenecked by the most loaded expert. To get around this, some implementations (e.g. the OLNN paper tried this in some of the appendix experiments) force equal allocations, but this is unwieldy and also harms latency, as the top-K function isn’t a good fit for the programming model used by accelerators (they’re good at matmuls, not sorting). However, this technique requires solving an integer linear program to allocate the tokens, which can be a nightmare itself; they use an approximation to allow it to run on TPUs.

Soft MoE

Abstract

I’ve saved, perhaps, the best (or at least most novel) paper for last. Unlike most other MoE models, which use the routing layer to discretely route tokens to experts, Soft MoE passes different weighted combinations of all tokens to each expert. The standard sparse MoE transformer has to solve a discrete optimization problem, which is difficult to optimize due to differentiability. By making this a soft combination, it is immediately differentiable.

In the Soft MoE algorithm, they have a batch of m tokens which we refer to as X. They have n experts, each expert with p slots, and we use Φ to refer to the parameters for the experts (a d x (n * p) matrix). The input slots to the MoE layer, X’, is the weighted combination of X:

The matrix D is just the output of softmaxxing over the columns of XΦ. We call the p-th expert over the corresponding rows of X’:

And then compute the output tokens, Y, with a softmax over the rows:

The output of softmaxxing over the rows is called the combine weights, while the output of softmaxxing over the columns is called the dispatch weights.

This formulation has several nice properties:

It is fully differentiable. There are no nasty discrete functions to kill the gradients.
Token dropping isn’t an issue, like it is with the top-K router from the OLNN paper (discussed in a previous edition), as every slot is filled with a weighted average of all tokens, and the weights are strictly positive thanks to the softmax.
It’s fast, as it avoids sorting/top-k operations, which are slow on accelerators (particularly TPUs).

A major disadvantage is that it doesn’t currently work with auto-regressive decoders, i.e. all generative LLMs.It shouldn’t be particularly difficult to extend it, but it will require additional (careful) work.

They compare their method to the tokens choice (where every token selects the top-K experts with the highest routing score) and experts choice (where every expert selects the top-C tokens in terms of routing score) routing techniques. The results are quite strong, with soft MoE performing distinctly better in the two image tasks they study:

It also performs better when compared against ViTs with similar FLOPS:

They perform a number of ablations, and find that:

The optimal number of slots per expert is small: 1 or 2.
The optimal number of experts is roughly the same number as the amount of input tokens

What I find particularly compelling about this model is that it removes the complexity of sparse models.

I’m curious about the potential to make a router that has the flexibility to learn the slots though; something that could interpolate between this and the standard expert/token choice MoE routing models seems compelling.

Some conclusions

My take away from reading these papers was:

More papers should use hash layers as a benchmark.
It’s not clear to me that differentiability matters much. It seems intuitively nice to have, but I wouldn’t really give up anything else to get it.
Soft MoE doesn’t seem completely ready for production— the optimal number of slots per expert being 1 or 2 seems prohibitively expensive— but I think that an approach like this is the future, as I suspect that the performance benefits will continue to grow as more researchers explore soft routing techniques.

If I were actively focusing on MoE research, I would be looking into combining these, and looking at a fully differentiable, expert choice, soft routing layer. It seems fairly straightforward to combine these, and I suspect the advantages would stack.

Linear programming routing, while it performs strong, adds a lot of complexity to the stack. I’m not convinced that it’s really worth this complexity. However, the performance is strong, so it is worth benchmarking.

Papers I’ve read this week, Mixture of Experts edition

Finbarr Timbers — Fri, 04 Aug 2023 16:12:06 GMT

Papers I’ve read this week, Mixture of Experts edition

Mixture of Experts (MoE) models have been getting a lot of attention lately, what with the all the rumours about OpenAI using them in GPT-4. I’ve been reading a lot of the foundational papers about MoE models, and I’ve taken detailed notes, which I wanted to share. This is a bit of a long one, so you might want to read this on the web.

Background

A standard deep learning model uses the same parameters for each input. For instance, when using a ResNet to classify images, the same parameters are used for every image. With conditional routing models, such as MoEs, the parameters each input sees vary depending on the input. For MoE models, in particular, they select different combinations of parameters for each input. We use the term “dense” to refer to plain old vanilla models which do not vary the parameters.

The name can be somewhat misleading. The MoE technique, at least as it is commonly used, is not an ensembling technique that is used to combine models trained independently. Rather, it is a clever way to distribute a large transformer over multiple GPUs. It is more analogous to a different type of model parallelism, such as the Megatron architecture. Each “expert” can be anything from an entirely separate transformer to a different constituent block of the transformer. Within the network, you have “router” layers which are typically much smaller than the rest of the network, and which decide how to allocate tokens across experts. The exact way in which this is done varies significantly by method.

When using conditional routing models, several problems crop up:

Allocating too many tokens to the top models. When learning to route models to experts, a “winners get bigger” effect exists, in which a positive feedback loop can cause the most successful model to see all of the tokens, causing that model to get better, and thus to have more tokens routed to it. This is problematic, as in the limit, this will lead to the model becoming a standard dense transformer.
Poor sharding performance. Even if the tokens are being allocated to all models in expectation, for efficient accelerator utilization we actually want each batch to be evenly allocated across experts. This can be hard to achieve.
Evaluating performance vs dense models. It can often be unclear how to compare the models. For instance, there were several labs which trained MoE models with >1T parameters. This was compared directly to GPT-3, which “only” had 175B parameters. The comparison is, of course, invalid, as only a fraction of these parameters are used for each request.

The papers I discuss below attempt to solve these issues.

One paper I will not be discussing is an excellent survey paper by Clark et al, which is an overview of how conditional routing models work, and which does a number of strong experiments to help understand their performance. I wrote about it in an earlier article, and will not repeat myself here.

Subscribe now

Outrageously large networks

Abstract

The OLNN paper proposed a Sparsely-Gated Mixture-of-Experts Layer, which was one of the first MoE implementations that was practical to use at scale, and was performant.

It was by Noam Shazeer, as seemingly all papers advancing the state of the art in NLP are, and in it, the authors apply a MoE layer convolutionally between stacked LSTM layers. Their proposed Mixture of Expert layers consist of a set of N expert networks (E(x)), and a gating network, G, that outputs a sparse, n-dimensional vector. The gating network decides how to combine the networks:

If G(x)_i = 0, we can skip computing E_i(x), which can be expensive (it can be, for instance, a decoder block from a transformer). You can also stack gating networks to create a hierarchy of networks. A common choice for MoE models is for G(x) to be a linear layer with a softmax activation:

Here, the authors add sparsity, and random, learned, Gaussian noise:

topK(x, k) is the identity function for the top K values, and sets the rest to negative infinity, and H(x) is a trainable noise function:

This complicated function allows the gating network to be trained using backprop in the standard way. It also has the nice property that the model scales proportional to k*b*d/n, where we choose k out of n experts, a batch of size b, and distribute the model over d devices. As such, if we maintain the k/n ratio static, then we can continue to scale the number of experts (and thus the total model size) as long as we have given Jensen Huang enough money to have the GPUs we need.

Interestingly, the model was written when LSTMs were the SOTA in language modelling, but the proposed architecture seems like a remarkably strong fit for current architectures, as it makes it easy to shard the experts over a number of GPUs. The paper uses a load balancing loss to try and balance expert utilization. The load balancing loss is an area of active research, and is one that will be focused on in subsequent papers. The loss used here is:

Where the importance is defined as the total weight that all the experts place on the batch X:

CV here is the coefficient of variation, i.e. the ratio of the standard deviation to the mean:

This loss will encourage equal importance, as the easiest way to minimize the loss is to minimize the variance of the importances. The authors also use an additional penalty term to further encourage load-balancing:

The load is defined similarly to the Importance, but they define a new function, P(x, i), as the probability that the i-th entry of G(x) is nonzero when you add noise to it:

They have a somewhat complicated way to actually implement P(x, i), and it’s worth checking out the appendix for the details. This results in their load balancing loss becoming

The authors also explored yet another loss in which they forced the model to strictly balance tokens across experts. It’s interesting to see how much time was dedicated to the load balancing; this is indicative of how critical this is to get the full benefit of MoE models.

The authors found a significant improvement in modelling performance when using their MoE models, seeing significant improvements when comparing models with approximately 8M FLOPS per timestep:

In the era of LLMs, it’s somewhat funny to look at the number of model parameters here; their largest model has 4B parameters, which would be considered quite small today.

Switch Transformers

Abstract

Another Shazeer paper, this was inspired by Kaplan et. al, which found that GPT model follow power-laws in scaling the model size, dataset size, and computational budget. They add add a fourth dimension: the parameter count, keeping the FLOPs per token constant. Like the OLNN paper, by keeping unique experts on different devices, the total parameter count for the model increases proportional to the number of devices, allowing for embarrassingly parallel operations.

The paper is a minor modification on the OLNN paper. They make two changes:

The authors use k = 1 (which they call a Switch layer), whereas OLNN routing used k > 1. They show this performs better.
They use an auxiliary load balancing loss that was the same as Shazeer (2018) and Lepihkin et al (2020), and is much simpler than the overly complicated loss(es) used in the OLNN paper:

P_i is the fraction of tokens allocated to expert i across all tokens in the batch, while p_i(x) is the probability of allocating token x to expert i. As we want to spread tokens evenly across the N experts, we want both the f_i and the P_i terms to be equal to 1/N, for all i, which is encouraged through this, as that is the optimal value for the P-vector to take, which is the only differentiable part of this equation. It is multiplied by N to keep the entire term constant (under uniform routing, the sum is equal to N * (1/N^2) = 1/N).

They find that these small changes are enough to perform better than both MoE and Dense transformers with equivalent runtime performance. The paper also asks a Chinchilla-esque question:

For a fixed training duration and computational budget, should one train a dense or a sparse model?

They find a significant advantage to training a sparse transformer vs a dense one, given a fixed computation budget. This matches my intuition; with much more parameters, and an efficient way of selecting between them, the model should have much higher performance.

As an applied practitioner, I think looking at inference cost is the better benchmark to use, which is even more in favour of sparse transformers. Once a MoE model has been trained with some number N > K experts of size P, where each inference involves routing tokens to the top K experts, it has the same performance characteristics as a dense model with KP parameters. As typically the inference costs will dramatically outweigh the training costs, this becomes the relevant benchmark.

Put differently, if we’re deploying thousands of GPUs to serve inference requests, we don’t have preferences over deploying them to serve a larger number of copies of the same dense transformer, or using them to serve a smaller number of MoE transformers: we’re computationally agnostic (assuming that we’re able to achieve uniform routing).

Two papers that the authors reference repeatedly throughout the Switch Transformers paper are the GShard and Mesh-Tensorflow papers, which go into depth about the computational details of implementing these models. They’re worth a read if that’s of interest to you.

ST-MoE

Abstract

I’m apparently a Shazeer fanboy today- a third paper1 of his in this line was the Stable and Transferable Sparse Expert Model paper, which built on the Switch Transformer paper to introduce the ST-MoE transformer. The paper focuses on the training instabilities that come up in training sparse expert models.

The paper focuses on identifying approaches that can improve training stability for sparse models. Training instability here means training runs which diverge:

The authors explore a bunch of methods to stability models, but note that many of them degrade model quality, which is undesirable. They also note a number of changes which worsen stability, but increase quality (e.g. modifications with more multiplicative components, like GEGLU or RMSNorm). They propose a new loss, the router z-loss, which doesn’t degrade quality:

where B is the number of tokens, N is the number of experts, and x are the logits going into the router. This is a penalty on large logits into the gating network. As such, the overall loss for the model becomes (yet another complicated, multi-part loss function!):

The idea is that modern transformers are trained with mixed precision, typically using bfloat16 for all matmuls other than gradient updates, which are stored in float32. As such, larger logits involve larger roundoff errors (table from the ST-MoE paper):

The routing layers in MoE models introduce additional exponential functions through the softmax function, and exponential functions, as they increase the scale of the values being considered, can lead to higher roundoff errors. This is particularly problematic when routing with k > 1 experts.

The authors run a number of experiments comparing top-K routing, and find that top-2 tends to be the sweet spot in terms of complexity and compute. While you can increase quality by increasing the size of K, it comes at a linear increase in compute cost.

The authors also perform a qualitative analysis of what the experts learn; they find that various experts specialize in punctuation, visual descriptions, proper names, and counting/numbers. This is interesting as that isn’t forced by any of the specific losses, but rather emerges from the model, as one might naively suspect. However, they don’t observe that experts specialize in languages when trained on multilingual datasets, which I found surprising. This sort of qualitative analysis is excellent and often missing. I wish more papers had this!

With this paper, we start to see the load-balancing losses simplifying and converging on more interpretable, manageable, functional forms, when compared to the more complicated setups from the earlier papers, such as OLNN.

Interestingly, the last author on this paper, William Fedus, is currently at OpenAI. He joined Jack Rae and Aidan Clark there, two of DeepMind’s authors from the “Scaling laws for routed language models” paper (although Jack Rae recently returned to DeepMind). Similarly, all of the authors on the Switch Transformers paper (William Fedus, Barret Zoph, and Noam Shazeer) have left Google, with Zoph and Fedus going to OpenAI. OpenAI seems to have accumulated quite a stable of sparse transformer experts! (I’m curious how they route work between themselves. 🥁💥)

The market for AI companies

Finbarr Timbers — Sun, 18 Jun 2023 18:44:51 GMT

After being laid low by a sick child turning into a sick family, I’ve got a bunch of articles in the queue, and I hope to have another one up by the end of June/early July, probably about conditionally routed language models.

The article that follows is a slight departure from my usual, more technical subject matter. Please let me know what you thought of it.

The market for AI companies

Recently, I’ve been talking to founder friends of mine who have been raising money, and to some VC friends of mine who are investing money in AI companies. I want to share some of the topics that we’ve discussed, give my perspective on what I would want to invest in if I were deploying capital/what I would be looking for if I were looking for a job, and try to generally make sense of a bizarre market.1

In these discussions, a few themes consistently come up:

The consensus is that AI is going to be (mostly) a “sustaining” innovation, by which I mean that most of the financial benefits will make the bigger companies stronger.
“Mostly” is key. There will be a few companies that do break through, and these will tend to dominate their market.
Many of the large AI investments won’t pan out.

Points 1 & 2 are really the same point: the winners tend to accrue all the advantages in AI. This is because, generally speaking, AI systems tend to get better over time. While there will be companies that develop new markets— we need only look at Midjourney/Dalle, or ChatGPT/Claude to see this— the majority of the applications of AI will be to make existing companies/products better. Consider image editing. While it is possible, of course, that a new image editing system is developed that replaces Photoshop, it seems far more likely that Adobe will acquire a generative AI company and/or continue to develop Firefly until it is the market leader.

As another example, consider one of the oldest AI systems in production: Google search. Because it was making money and had a ton of users, Google was able to start cranking out optimizations at every level of the stack: they had custom hardware, they were globally distributed, they were able to keep hiring employees who could make it better in every way, and they had a sales team selling ads which was financing everything. If you came up with a brilliant new search algorithm, you’d have to compete with the Google search algorithm, the Google ad sales team, the Google datacenter team, and the Google Chrome team. All of these were working together to make their product better. In addition to all of this, Google was still working on their algorithm, and had more data than anyone else to train it on.

This is generally true for AI systems. Yet another example, this time of a new product that has developed a strong market position, is OpenAI and Anthropic. OpenAI has, roughly, 400 employees, while Anthropic has around 100. OpenAI raised ~$10B from Microsoft, while Anthropic raised $450M. OpenAI has dominated the news, so non-technical people (like, say, my chemical engineer Dad, who’s an avid user of ChatGPT) use their products, while only geeks like myself, and you, dear reader, use Claude. As a result, if OpenAI doesn’t make any massive mistakes, they’re going to be able to scale to more users and improve more quickly. So even if Anthropic is able to make a LLM that is just as good, I struggle to see how they’ll be able to steal significant market share from OpenAI unless they come up with a fundamental breakthrough.

A contributing factor to this is the scarcity of GPUs. Just being able to serve requests to customers is a competitive advantage right now. OpenAI’s model latency is reportedly ~2-4x slower than it was during April, likely because they’re hitting the limits of their cloud resources. This creates a positive feedback loop: the companies with the most money have the most GPUs, so they can serve the most requests, each of which they make a profit on, giving them more money, which they use to buy more GPUs.

Another point is that machine learning relies on data, and if you’re interacting with consumers, you’re getting a large amount of data that is specifically tailored to your application. That is an advantage that will keep getting more valuable. You can hire more researchers to develop more effective techniques, which compounds with the larger amount of data. Leads to an accumulating advantage.

Yet another advantage is that, as you serve your product to users, you start to accumulate research for your specific problem. With each additional improvement you layer onto your product, it becomes harder for newcomers to compete. Consider, for instance, Google Search, as discussed above, or ChatGPT, which seems like a counter-example given how many LLM chatbots have been invented to compete. But they’re all notably worse than GPT-4, and they struggle with the same problems. OpenAI has been able to iterate for 6 months and has been introducing various add-ons to improve ChatGPT. The only LLM that comes close to GPT-4 performance is Claude, and even then, there’s a gap. This is despite an absolutely outrageous amount of capital being deployed and most of the research being open source.

The bar to create a ChatGPT competitor (which was released in November, 2022) was relatively low, and there are a number of products that are comparable to the original version. But the current version has improved dramatically, making it tough for competitors to keep up. It’s particularly hard because ChatGPT (and Claude) have had access to a lot of user data, which I strongly suspect they’re able to use to make their model better, through RLHF. This creates a path dependency issue, where to compete with them, one needs to have a large number of iterations, which is only possible after your model has been running for a while. The only room for competition, from what I can see, is in going after niches, particularly niches that are unsavory (sexually explicit content, politically biased content, etc.) that the big companies won’t go after.

This a point worth expanding on; there is a substantial market opportunity in going after the Generative AI market in the niches which the current AI safety crowd has deemed unethical/unsavory. If we look at, for instance, Civitai, their front page is almost all pictures of pretty light-skinned women, or if we search #stablediffusion on Twitter, we see many pictures of scantily clad women, while searching #dalle returns much more surreal, sci-fi imagery. Another example, of course, is Replika, which allows one to chat with a “flirting companion.” ChatGPT/Claude very much do not provide any experience like this. There’s nothing wrong with this, of course, as these are all perfectly valid businesses for one to get into, but tech as an industry is famously prudish, with many platforms banning any form of explicit content.

This brings me to another point: the current AI funding scene doesn’t make sense. There are, to the best of my knowledge, six (6!) foundation model companies:

OpenAI
Anthropic
Google/DeepMind
Stability.ai
Inflection
Cohere
Mistral

This is not to mention a bunch of other startups that are trying to compete in this space as well.

To be blunt, I don’t see how these have businesses which justify their multibillion dollar valuations. The simple way to value a (mature, public) company is to assign it a multiple based on the profit it makes. Companies on the NASDAQ trade at roughly 20x profits. So to be worth $4B, which is what rumours say Anthropic recently raised at, they need to make $200M in profits. To have a 10x return on that investment, which is typically what VCs are looking to make on a successful investment, they would need to make $2B in profits. The same applies for all of these businesses.

The problem is, though, that I don’t see any reason to buy the second best LLM, let alone the 6th best. I think every major cloud provider will have a foundation model API. After that? Who’s going to buy these? Maybe they’ll develop independent businesses- but I struggle to see it. I think that their best bet will be to be acquired. But at the valuations they’re currently at, it’s tough to imagine investors seeing venture scale returns.

Consider the recent round that Mistral raised: $113M at a $260M valuation. For VCs to see a return on their capital, they expect to get a ~10x return on this money, which would require a $2.6B exit. There aren’t many of those!

So if many investments aren’t going to pan out, why are they being made? Two reasons:

FOMO.
Logo hunting.

As a VC, you have two ways to make money. Most VCs charge 2+20, where they charge 2% per year of the assets under management (AUM), and charge 20% of the returns from their investments. They can either keep expanding AUM and get 2% of that, even if their investments don’t pan out, or they can get money from successful investments. It’s hard to pick investments (and takes a long time to actually see returns)! But if you can consistently prove to your limited partners (LPs, the people who invest in VC funds) that you can get in the hottest deals, they’re going to give you more money.

So many VCs are investing in hot companies to show they can invest in hot companies, even if the valuations don’t necessary make sense. This is not a criticism of VCs; they are making rational decisions which will, almost certainly, make them money. But it is not the case that every one of these AI unicorns will make money. I think the bets that are being made are reasonable for VCs, who are able to take a high risk, high reward approach to their portfolio, but for individuals considering which roles to take as an employee, you should view these foundation model companies as being very high risk/reward bets and value your equity accordingly.

A market for lemons

AI companies right now are incredibly odd by the standards of most companies, because many AI companies are able to make money from a very early point. They’re developing large consumer subscription businesses, which are, basically, the best businesses to have (other than search advertising). Money keeps rolling in every month. This is allowing these companies to self-fund, reducing their dependency on VCs.

As such, I would be hesitant to join a company that didn’t have revenue right now, unless I was joining as part of the founding team. AI is such a dynamic place that it’s hard to justify making large R&D investments that might not pan out for 18+ months (let alone several years). That’s tough for investors. This has resulted in a strange bifurcation, with two broad categories of AI companies:

Companies which require a massive amount of capital for large, long-term bets, which they use to hire expensive, ex-DeepMind researchers (wink) and give directly to Jensen Huang.
Companies that require two people, a garage, and a few A100s to make a product that can be directly sold to consumers.

This gap is brutal for investors. #1 has a longterm payoff, which could be over 10 years, and there are relatively few examples of this making money in AI. Sure, there’s OpenAI, but they’re a bit of a fluke. What you really want is to invest in #2, but if you’re them… why take investment? Just build the project and try to sell it! If you can start making money, then raise, but you raise on much better terms, with much less dependence on investors.

This means, that, basically, there’s a trap when it comes to investing/raising funding at the pre-seed/seed stage. Companies that fall into category 2 don’t need it, so you end up funding category 1 companies, which are much riskier bets. This levels out at Series A, as then the category 2 companies become similar to standard consumer SaaS companies which can take investment as usual to build out sales, finance, support, etc., but it means that there’s a chasm of missing seed stage investor opportunities.

Either you make money, in which case you can bootstrap, or you don’t, in which case you can try again. What investors want, generally speaking, is a machine with predictable returns where you can put money in and predictably make the company more valuable. The poster child is Uber, which at its peak, was able to use additional investor dollars to expand into new markets. They had the option (which they have now exercised) to stop spending on expansion and start focusing on profitably.

With AI companies, however, it seems like companies are able to find (or not find) product market fit very, very quickly. Consider:

Lensa
Midjourney
Stability.ai
Character.ai
Replika

All of these products were able to find product market fit very quickly. Now, to a certain extent, this is because the underlying technology has been built out by research funded by the big companies, particularly Google, but the point remains.

Of course, these companies are all investable, as they can now take investor money and turn it into a bigger team, more sales people, more researchers, more support staff, etc., but they can skip straight to the Series A without having to go through pre-seed/seed as it is so cheap to build these products.

In short, there’s a lot of interest from investors at pre-seed/seed, but I think many, many companies will find that, when they go to raise their A, they’re not the darlings they once were.

In a startup, these are basically the same thing, as by taking equity in the company in exchange for salary, you are directly investing in the future success of the company.

Efficient LLM inference

Finbarr Timbers — Tue, 09 May 2023 16:22:07 GMT

Lately, I’ve been thinking a lot about inference, and particularly, how to serve a given LLM more efficiently. The scenario is as follows: Your boss comes to you and says Hey Finbarr, we’re about to go bankrupt because we’re spending all of our investor’s money on GPUs serving our 300B parameter model that raps in the style of John Kenneth Galbraith. What can we do?

Broadly speaking, there are three main classes of things you can do:

You can quantize the parameters of your model (quantization), where you keep your model exactly the same, but use less precision for each of the parameters.
You can distill a smaller version of your model (distillation), where you copy the architecture of your model to make it smaller and/or more efficient and then train this new, smaller model to mimic the outputs of the original, large model.
You can spend a bunch of time profiling your code and reduce the overhead without changing the architecture or parameters (optimization).

The first place to start, somewhat obviously, is optimization. The amount of overhead that most programs have is ridiculous, and by simply profiling1 your code, you can often find surprising amounts of overhead. For instance, I once had a colleague ask for help optimizing his code. He was training a neural network to perform a sophisticated calculation and had implemented a bunch of performance optimizations to make that faster, but he was also using a list to do lookups in a performance critical loop. I changed the list to a dict and made the code 200x faster.

Subscribe now

This isn’t a rare occurrence. Every time I’ve profiled code, I’ve been surprised at the resulting profile. So if you’re having performance issues (don’t worry, it happens to all of us, it’s natural), the first thing you should do is profile your code.

For most people, this will be sufficient. Just removing the overhead from your code and batching requests in the naive way will get you to the point where you can serve requests in a cost-effective manner, particularly if you have traditional software margins. But let’s say that you’ve done a bunch of profiling and you’re now at the point where the only remaining optimizations are implementing arcane kernels in Triton which requires hiring grizzled old CUDA experts away from Nvidia. What’s the next step you can take to use your GPUs effectively?

Now, you’re left with quantization and distillation. Quantization, where you use less precise weights for your neural network without changing anything else about it, has been talked about a lot lately. Llama.cpp, for instance, used this to great effect to reduce the memory required to store the llama weights by 4x.

Distillation has received less attention but has historically been an important part of serving models at scale. This is because distillation generally works much, much better than quantization, and if you have the resources, should be the way you do things.

There’s a key caveat there: if you have the resources.

Get 20% off a group subscription

Let’s go back to our hypothetical scenario. You’re a hardworking ML engineer at CoherentOpenStability, where you’re trying to reduce the inference costs for your latest and greatest LLM, StableClaudius-4. You’ve already profiled your code and reduced all of the overhead that you can. You now have a few options:

You come up with a research breakthrough which lets you accomplish the same thing, for cheaper. E.g. you design a new sparse attention mechanism which works well.
You make your model smaller.

If I were to compare these, the obvious winner is #1. If you can come up with a novel research contribution that magically improves your model, you should obviously do that. If this is you, stop reading this article, go write a paper, apply to OpenAI/Anthropic/DeepMind, and collect a ridiculously high salary for being a large language model whisperer. Most of us cannot do this. So we’re stuck trying to come up with a smaller model that accomplishes the same things.

How should we come up with a smaller model? A few options:

You train a smaller model in the exact same way as your original model.
You distill your big model into a smaller model.
You quantize your existing model.

In my opinion, the literature indicates a clear & obvious ranking: distillation is strictly better than training a smaller model, and quantizing is probably better than training a smaller model.

There aren’t as many distillation papers as I would like, but the two that come to mind are DistilBERT and the original distillation paper from Hinton et. al. In DistilBERT, the authors reduce the model size by 40% while only hurting performance by 3%.

In the Hinton et. al paper, they’re able to match the performance of an ensemble of 10 models with a single, distilled model, and performance only decreases from 61.1% accuracy to 60.8% accuracy (99.5% of the original performance, with 10% of the size). Now, the Hinton paper is comparing against an ensemble, which is a particularly wasteful way to increase model size, but that’s still impressive result, and much better than training a model from scratch to perform the same task (which had only 58.9% accuracy).

The problem with distillation, however, is that it requires training a smaller model from scratch and running inference over your entire dataset with your large model. If you have a dataset the size of GPT-3 (500B), this would cost $1M at public API prices (5e11 tokens * 2e-6 $/token = $1e6), or $400k if we assume OpenAI has a 60% margin. Given that it cost approximately $5M to train GPT-3 initially, this would add 10-20% to that already large cost. Not prohibitive, but expensive.

If you can afford this cost, great! Do it. It’s almost certainly going to give you the best performance. If you want something cheaper, you’re deciding between training a smaller model from scratch and quantizing an existing model. To help, we have a paper k-bit inference scaling laws. The idea is that, from an inference perspective, we’re agnostic between serving a 30B model at one level of precision and serving a 60B model with twice the level of precision, as most GPUs are twice as fast at running models with half the precision (e.g. A100s).

This figure shows the tradeoff between using various model sizes with various levels of precision. Let’s compare two points for the OPT line of work.

Model precision Bit precision Mean zeroshot accuracy $10^{11}$ 8 0.675 $10^{11}$ 16 0.65 $10^{12}$ 8 0.725 $10^{12}$ 16 0.7

What we see is that, given a total number of model bits, we prefer the model with fewer bits per parameter. Intuitively, this makes sense: we don’t see a benefit from training half as many parameters with fp64 vs fp32.

If we look at another figure, this time from the OPT paper, we can analyze how performance scales with the number of parameters. As OPT uses FP16, which uses 2 bytes (or 16 bits) per parameter, 1e11 parameters is equal to 1.6e12 bits. By using 10x less parameters, and going from 1.6e12 to 1.6e11 bits, the average accuracy for OPT goes from 0.7 to 0.65: a 10x decrease in cost for a 8% decrease in accuracy. Not quite as good as the model size/accuracy tradeoffs we see with distillation, but I think that most businesses would have to strongly consider the tradeoff.

The other thing to keep in mind about quantization is that it’s remarkably cheap to do! The SOTA method for quantization is GPTQ, which can quantize a 175B parameter model in 4 GPU-hours (roughly $4 of cost at public cloud prices). Training the model from scratch, on the other hand, costs a lot; a rough estimate of the cost to train a GPT-3 style model is $5M for the full model, with cost scaling linearly in the number of parameters, so a 20B model would cost ~$500k, and requires a lot of data (~100B tokens to be Chinchilla optimal).

So quantizing is great. But what, exactly, is quantization, and how does it work?

The idea behind quantization is simple. Computers, due to their discrete nature, can’t natively store floating point numbers. digital numerical representations are based on bits, namely 1s or 0s. These bits are assembled into binary. In a binary integer representation, you can represent a range of

using a signed integer, where n is the number of bits. One bit is reserved to represent whether or not the number is positive or negative, and n - 1 bits are used to represent the magnitude.

This works well, and is reasonably efficient. However, the problem comes when you want to represent the real numbers, i.e. numbers that can take on values between integers. The most common approach is to reserve 1 bit to indicate the positivity/negativity of the number, m bits to represent the magnitude of the number (the exponent), and (n - m - 1) bits to represent the precision of the number (the significand)

The significand is just a (n-m-1)-bit unsigned integer, and can thus represent values up to 2^{n - m - 1}.

In a 32-bit floating point number (single precision), 1 bit is used for the sign, 8 bits for the exponent, and 23 bits for the significand.

In a 16-bit floating point number (half precision), 1 bit is used for the sign, 5 bits for the exponent, and 10 bits for the significand.

In a 64-bit floating point number (double precision), 1 bit is used for the sign, 11 bits for the exponent, and 52 bits for the significand.

Note where the additional bits are going— they are mostly going to the significand, which adds precision, rather than magnitude. In other words, this lets us distinguish between smaller numbers, rather than allowing us to represent bigger ones.

By default, all major tensor programming frameworks use 32-bit precision to store trainable parameters. There’s a reason for this: 32-bit precision tends to be a good default. There are very few applications which benefit from the additional precision (mostly scientific computing applications). However, in practice, most of the bleeding edge work now uses 16-bits.

But ok. Now that you’ve read through my digression on how precision works in floating point numbers, let’s say we’ve chosen a level of precision. How do you actually lower the precision of your weights? The naive approach is to simply truncate your weights at a given level of precision. As a simple example, if your weight is 0.534345, naive truncating the weights will convert it to 0.534.

The SOTA model for quantizing to 4-bits or below is GPTQ. Some other methods are LLM.int8() and ZeroQuant. I’ll discuss these in depth in a future article, but here, I’ll focus on GPTQ. The basic idea behind GPTQ is that, while there’s necessarily a drop in information contained within the network by reducing the number of bits, we can reduce the impact it has on inference accuracy by training weights to directly minimize the difference between the two:

Let’s walk through an example. Let’s say that x = 0.323, and as above, w = 0.534345. Then, keeping everything as a float32, the activation output is:

which, rounded to 6 decimal points (the precision for float32s), gives us an output of 0.172593.

Rounding naively, our output is

The difference here is 1.114e-4. If we use GPTQ, we solve

which gives us

which, rounded to 3 decimal points (the precision for float16s)… gives us precisely the same answer as naively rounding, but with more effort.

Presumably this would matter more in other scenarios? I haven’t been able to come up with a simple example that makes GPTQ worth it. But in actual deployment scenarios, GPTQ claimed a significant difference (RTN meaning “round to nearest”):

So this is a method that works much better than naively rounding, and is cheap.

Subscribe now

Conclusion

Quantization isn’t magic. Ultimately, you’re always sacrificing accuracy for performance. Maybe you won’t lose a lot. But you’ll never gain accuracy, so at best you’re staying the same.

It’s also unclear how often the tradeoff is worth it. Tim Dettmers scaling law for quantization. If you’re using half the precision, it might be worth using the same precision and half the weights and training for twice as long on more data. This is what, for instance, replit did. For many practitioners, the cost to serve a model heavily outweighs the cost to train the model. If this is you, you might not care about quantizing one.

Even if you do, distillation will typically outperform quantization. So if you can distill the model, you probably should. It’s only when you don’t have the resources to do this that quantization is clearly worth it.

Finally, with quantization, you only get a linear speedup as you decrease the number of bytes. That’s pretty good! But ideally we’d want to see much better scaling. Perhaps some sort of sparsity will do better.

Not even using a fancy GPU profiler, but just profiling the program as a whole using the basic profiler for your language!

Papers I’ve read this week: Image generation

Finbarr Timbers — Tue, 11 Apr 2023 16:32:09 GMT

I’ve been reading about a lot of image generation models lately, focusing on the OpenAI papers, as they seem like the basis of a lot of future work. Here, I discuss 4 seminal papers:

CLIP
DALL-E
GLIDE
DALL-E 2
Subscribe now

CLIP

Abstract

This isn’t an image generation paper per se, but is used by a lot of them as the embedding scheme. The main idea behind the paper is to predict which caption goes with which image as a way of doing self-supervised learning from a dataset of (image, text) pairs. If this works, then the result will be aligned text and image embeddings, which would be really useful for a variety of applications, e.g. search, image generation, zero-shot classification, etc.

The architecture is pretty simple; they jointly train an image and a text encoder on a batch of N (image, text) pairs using a contrastive loss which predicts which of the N^2 pairings actually occurred. They maximize the cosine similarity of the image & text embeddings of the real pairs, while minimizing the cosine similarity of the N^2 - N incorrect pairs. Pseudo-code:

Once the model is trained, they’re able to use it for a variety of tasks. My favourite is zero-shot classification; they ask the model to assign probabilities to the text “{class}” and use that as the probability that the image is a member of the class. Interestingly, by using the prompt “A photo of a {class}.”, they got a 1.3% improvement in model accuracy.

Using this zero-shot classification scheme, the results are quite impressive: they’re able to match the accuracy of the original ResNet-50 on ImageNet zero-shot, without training on any of the ImageNet dataset.

They experiment with two separate image models, one based on various ResNet architectures, and one based on ViT. They find that the ViT performs better, and is actually cheaper to train than the ResNet.

This is really interesting to me, as this is a prompt that humans were able to come up with. I’m curious what the optimal prompt is. An experiment I’d love to run would be to use some sort of RL driven approach to explore the token space and find the best prompt for such a classification scheme; I suspect there’s more performance that can be squeezed out of it.

In any case, CLIP is remarkable because it’s a very straightforward embedding method that works quite well, and is able to scale to massive amounts of data. It’s further evidence that large scale self-supervised pre-training is the way to go to achieve SOTA performance with neural nets.

DALL-E

Abstract

This paper uses a transformer to auto-regressively model text & image tokens as a single stream of data to enable zero-shot image generation

For a long time, image generation was dominated by GANs, but they’re tough to scale. I’m not sure exactly what the problems are with scaling them; mode collapse is always cited, but I’m not up to speed with that literature. I’m going to dive into it for a future “Papers I’ve read this week.”

DALL-E was the first (afaik) model to massively scale autoregressive transformers on large image datasets. They trained a 12B parameter model on 250M (image, text) pairs using a two-stage training procedure:

They train a discrete VAE to compress each 256x256 RGB image into a 32x32 grid of image tokens, each element of which can assume 8192 possible values.
They concatenate up to 256 BPE-encoded text tokens with the image token.

The concatenated embedding is fed to an autoregressive transformer to model the joint distribution over the text & image tokens, which is equivalent to maximizing the standard ELB.

Let x denote the images, y the captions, and z the tokens for the encoded RGB image. They model the distribution via

i.e. they learn two models— p_\theta, the distribution over the RGB images generated by the dVAE decoder from the image tokens, and p_\psi, the joint distribution over the text and image tokens modelled by the transformer.

The loss has the lower bound

The bound only holds when \beta is 1, but they play with other values, and find it useful to use bigger ones.

I wrote up a derivation on my blog, as it was unclear to me where the lower bound came from. As we’re maximizing the loss, maximizing the lower bound is fine. q_\phi here is the distribution over the image tokens generated by the dVAE encoder given the RGB image x.

This paper, like many image generation papers, has to abstract away from pixels:

using pixels directly as image tokens would require an inordinate amount of memory for high-resolution images

We see this in many image generation models, as pixels are really expensive, while if you can learn a mapping from text/image tokens into some latent space, you can then learn a separate mapping from the latent space to pixel space, and then upgrade this separately. This modularity is particularly useful for production systems, as you don’t have to train everything to experiment with your system.

They first train the dVAE to learn a visual codebook by maximizing the lower bound using the gumbel-softmax relaxation (they have to use this as q_\psi is a discrete distribution, so we can’t use the reparametrization gradient to maximize it).

Then, they fix \phi and \theta, and learn the prior distribution over the text & image tokens by maximizing the ELB with respect to \psi, using a 12B parameter transformer for p_\psi. The model is fairly standard, using BPE to encode the caption, and getting image tokens from the dVAE encoder logits. The transformer itself is a decoder where each image token attends to all text tokens.

Shockingly, given the current state of the art in LLMs, this model only takes up 24GB of memory, so it could be trained on a single A100 (as they have 40GB of memory). They only had access to V100s for whatever reason (which only have 16GB of memory) so they had to use ZeRO to train their model on multiple machines.

They used PowerSGD, which I had never heard of before, which dramatically reduced the amount of communicated needed between GPUs, as they were able to compress the parameter gradients by a factor of 85%.

To sample from their model, they rerank the samples from the transformer using CLIP, which assigns a score for how well the image matches the caption; they generate N samples and select the top K to show to the user.

GLIDE

Paper

In Glide, a 3.5B parameter diffusion model is used to turn text embeddings into images. They explore two different guiding methods: CLIP, and classifier-free guidance. The model is quite straightforward, using a vanilla transformer to generate embeddings, and a vanilla diffusion model to output images from the embeddings. The CLIP guidance method is interesting; it’s described as:

They train two models:

A 1.2B parameter text encoding transformer which takes in the text tokens as input and outputs an embedding
A 1.5B parameter upsampling diffusion model, which takes in the embeddings and outputs the final image

this seems like a really logical architecture- I'm surprised it was novel? I need to read more of the preceding literature to understand the context. To evaluate their model, they use a bunch of tasks to evaluate their model, such as in-painting, composition, corgi images, etc. I'm curious how much of this was introduced by them vs introduced by others- I don't have the context to understand that, but the way they evaluate their models seems very similar to how all subsequent models are evaluated.

DALL-E 2

Paper

DALL-E 2 uses a two-step training process: first, train CLIP, then, train a text-to-image generation process from it. In the text-to-image generation process, they have two models:

A prior, which takes in the CLIP text embedding, and outputs an image embedding,
The decoder, which takes in the CLIP image embedding and outputs the image.

The models are both diffusion models, using

They explored two different priors, one an autoregressive (AR) prior using a ??? to iteratively build the image embedding from a sequence of discrete codes, and the other, a diffusion prior, which directly models the image embedding. It’s worth noting that both methods involve iteratively building the embedding; the AR prior iteratively builds a sequence which is the embedding, while the diffusion model iteratively refines the embedding.

Given that the diffusion model is preferred by human evaluators, that seems to do better. This makes sense— the diffusion model is able to modify any part of the embedding, while the AR model is only able to modify the element at a specific index. There are interesting implications to language modelling here— maybe Sander Dieleman is right, and diffusion language models are the next big thing.

Comparing DALL-E 2 to GLIDE, the main difference is that image generation is split into two steps:

Transforming the text embedding into an image embedding
Generating the actual image

Naively, it makes sense to me to combine these two steps and train the whole thing end-to-end, but as DALL-E 2 was a significant improvement over GLIDE, it seems like my intuition is false.

In particular, it’s not clear to me why we need to go from CLIP to CLIP, isn’t the whole premise of CLIP that the text and image embeddings are in the same latent space (and thus equivalent)? I posted this comment on Twitter, and I got some interesting comments:

Sharif noted that while CLIP should, in theory, be encoding “both text and images into the same latent space, there’s still a large gap between the embeddings from each modality”, pointing me to a paper about understanding this modality gap.
Benjamin hypothesized that there is a loss of spatial information due to the spatial encoder mapping the (H, W, C) images to (embedding_dimension,) embedding vectors.

To me, this indicates that CLIP isn’t nearly as good as we would hope, and there’s significant room to improve it.

Subscribe now

Future papers

I haven’t read, but plan to read, the following papers, possibly for a “diffusion models” version of “Papers I’ve read this week”:

https://arxiv.org/abs/2302.12248
Generative modelling through SDEs: https://arxiv.org/abs/2011.13456
Latent diffusion models: https://arxiv.org/abs/2112.10752
Classifier guidance: https://arxiv.org/abs/2105.05233
Classifier-free guidance: https://openreview.net/forum?id=qw8AKxfYbI
ControlNet: https://arxiv.org/abs/2302.05543

Five years of progress in GPTs

Finbarr Timbers — Wed, 29 Mar 2023 18:48:41 GMT

A note: Substack doesn’t do a great job rendering equations, so you might want to read this article on my blog (but first, subscribe on Substack to be notified of future articles).

I have started to support paid subscriptions. If you have found this newsletter professionally useful, and want to help me spend more time writing, please consider signing up for a paid subscription.

In this article, I discuss the generative pre-trained transformer (GPT) line of work, and how it has evolved over time. I focus on the SOTA models, and the differences between them. There are a bunch of different articles summarizing these papers, but nothing that I’m aware of that explicitly focuses on the differences between them. Until now.

I focus on the GPT line of research as that’s what’s driving the current fever pitch of development. There’s a ton of prior work before large GPTs (eg the n-gram models from the 2000s, BERT, etc), and after (e.g. RWKV) but this post is super long, so I’m gonna save those for future articles.

I also don’t go into any detail about RLHF or other finetuning methods. I’m planning to write about that in the future. Those techniques are critical to the performance of the deployed LLM systems like ChatGPT, Claude, LaMDA etc., so they’re worth understanding. I don’t discuss any of the dialog specific systems, as I want to focus on the most general, pure language modelling transformers.

Subscribe now

GPT

Abstract

The first GPT paper is interesting to read with hindsight. It doesn’t appear like anything special and doesn’t follow any of the conventions that have developed. The dataset is described in terms of GB rather than tokens, and the number of parameters in the model isn’t explicitly stated. To a certain extent, I suspect that the paper was a side project at OpenAI and wasn’t viewed as particularly important; there’s only 4 authors, and I don’t remember it particularly standing out at the time.

The architecture is remarkably unchanged compared to GPT-3:

Decoder-only transformer, with 12 layers, 768 embedding dimension, 12 attention heads, and 3072 (4x the embedding dimensions).
They use Adam, with a warm up, and anneal to 0 using a cosine schedule.
Initialize weights to N(0, 0.02), using BPE with a vocab of 40k merges.
Activations are GELUs.
Context of 512
117M parameters
Learned position embedding, not the sinusoidal ones from Attention is all you need.

The number of parameters isn’t explicitly discussed, but appears to be roughly 120M, easily enough to fit on a single V100 or a standard consumer GPU (rough estimate of 120M parameters for the model, 240M for the optimizer, for 360M parameters; assuming each is a float32, then this takes up 4 bytes * 360M = 1440MB/1.4GB.

They use the BooksCorpus dataset (1B tokens1, 4GB), training for 100 epochs with a batch size of 64. 1B tokens is a small dataset by modern standards, as is a batch size of 64.

The most surprising thing compared to modern GPTs is that they train for 100 epochs. Modern GPTs rarely ever see repeated data, and if they do, they typically only see certain datapoints a small number of times (2-4x), and the entire dataset is never repeated 100x.

GPT-2

Abstract

GPT-2 is where the language models start to get big. This is the first time that OpenAI trains a model with >1B parameters. We start to see scale as a primary concern; in GPT, the authors trained a single model, but here, the authors train a range of models, with sizes ranging from GPT to 10x GPT (which is the actual GPT-2 model).

The differences in architecture compared to GPT are as follows:

They layernorm the inputs and add an additional layernorm to the output of the final self-attention block
Weights are scaled by layer by 1/sqrt(n)
Vocabulary of ~50k (up from ~40k)
Context of 1024 (up from 512)
Batches of 512 (up from 64)
Largest model is 1.5B parameters

The dataset is much, much bigger, going from 4GB of data consisting of publicly available books, to 40GB (or 9B tokens)2 of text scraped from the internet (WebText).

It’s unclear if they trained the model for 100 epochs as before; they say they followed the same training procedure, so presumably they did. Again, this is a significant departure from later work.

Nothing here is particularly different from GPT; most of the changes are related to making the model bigger. The only other changes are the layernorm changes and the weight scaling, which don’t seem to make a big difference (although, as always, more ablations would be nice).

Kaplan et. al

Abstract

I feel like there has to be a better name to refer to this paper, but I can’t find one, so I just call it Kaplan et. al. This was one of the first (maybe the first?) scaling law papers for LLMs. In it, the authors train a large number of GPT style models to make empirical predictions for how the model characteristics vary with scale. This paper was highly influential as it formed the basis for GPT-3, justifying scaling to 175B parameters (hitherto unseen level of scale).

This paper is notable as it did real science, running a number of experiments and making predictions as to how models should scale. It stands up very well.

Some notable results from the paper:

They found that the model performance (in terms of test loss) relies heavily on the number of parameters and the number of tokens trained on, with the model architecture having very little impact.
Model performance follows a power-law with each of N (the number of parameters in the model), D (the number of tokens trained on), and C (the amount of compute used for training). If any of these are held fixed, then performance rapidly hits diminishing returns.
Large models are more sample-efficient than smaller models. This is, to a certain extent, foreshadowing for Chinchilla (which I will discuss later).
It is possible to determine the optimal batch size for the model by measuring the gradient noise scale following McCandlish et. al. This is not novel, but it is important to keep in mind, as many practitioners determine batch size empirically, when it is possible to calculate this directly.

This paper was, until Chinchilla came out, the gold standard for how to train large language models.

GPT-3

Abstract

Here is where the era of truly large language models began, and the current AI ~~bubble~~ excitement took off. In the paper, the authors train 10 models, varying from 125M parameters (”GPT-3 Small”) to 175B parameters (”GPT-3”).

For each of the models, the architectures are identical to GPT-2 with the exception that they use “alternating dense and locally banded sparse attention patterns in the layers of the transformer.” The sparse attention here refers to the attention mechanism introduced in the Sparse Transformer, which lets attention scale proportional to O(n √n) (where n is the context length). The standard dot-product attention mechanism scales proportional to O(n^2), so this is a substantial gain. I would have loved a proper ablation to see what difference sparse vs dense attention makes, but alas.

I’m very curious why they used sparse attention. Reproductions and later papers uniquely use dense attention. As this paper came before FlashAttention and some of the other algorithmic innovations that make dense attention faster, maybe this was a computational bottleneck? It’s really unclear.

They don’t provide any detail about the computational architecture, i.e. how they distributed the model. The authors claim it’s because it doesn’t really matter, but I think it was restricted for competitive reasons, as it makes the paper much more difficult to reproduce. Megatron, which I’ll discuss later, was highly influential because they went into detail about how they made model parallelism work for their GPT.

What I find really interesting about the GPT-3 paper is that it was an incredible advance without a lot of novelty. They took their existing methods and “just” scaled it up! Because of the need for novelty, there are many research projects that don’t get pursued because they’re “only” engineering projects, or they “only” do hyper-parameter tuning and wouldn’t be able to get published, even if they had impressive performance improvements. That OpenAI went against the grain here is a credit to them (and they were rewarded, with GPT-3 getting a best paper reward at NeurIPS ‘20).

This is a strength of OpenAI (and Stability.ai, Midjourney, basically everywhere that’s not FAIR/Google Brain/Deepmind/etc). You could alternatively frame it as a weakness of the more academic labs that have promotion/performance review policies driven by publications.

Jurassic-1

PDF

I wasn’t sure whether or not to include Jurassic-1. It’s a model from the Israeli tech company AI21 Labs. I haven’t heard a lot about them, but the paper’s cited by a bunch of the papers later on in the article; they trained a 178B parameter model that outperformed GPT-3 in a few categories, and was faster for inference. It’s impressive that they’re competing with DeepMind, OpenAI, Nvidia, etc. despite only having raised <$10M at the time. They made a zero-shot and few-shot test suite publicly available.

Like many other papers, they don’t go into detail about the engineering details behind training a large model (178B parameters) over 800 GPUs:

The paper is remarkably sparse on details, which I suspect was done for competitive reasons, just like GPT-4.

Facebook is the only company to go into detail about their experiences training a 175B parameter model, just like Nvidia is the only company to go into detail about the computational architecture required to train a LLM over many GPUs (see: the Megatron paper, next). In both cases, the companies are commoditizing their complements and strengthening their main lines of business by making it easier to train large models.

Jurassic uses a different architecture from GPT-3, but again, doesn’t go into much detail:

76 layers (vs 96 layers for GPT-3)
They use the SentencePiece tokenizer, with a large vocabulary of 256K (vs GPT-3 which used BPE w/ ~50k tokens).

Neither of these changes are material, in my opinion. I think what we’re seeing is that there’s a relatively large degree of freedom in model architectures which produce similar results. This is borne out by their evaluation, which has results similar to GPT-3 (better in some categories, worse in others), although Jurassic-1 is faster for inference due to being shallower.

We’re starting to see a consistent pattern emerge:

Papers introduce a bunch of changes, their own dataset, and have a new SOTA
but they don’t do a proper ablation, so it’s tough to understand what was important and what drove the improvements

GPT-2, GPT-3, Jurassic-1, etc. all did this.

Megatron-Turing NLG

Megatron was a highly influential paper that introduced efficient model-parallel architectures. If you’re interviewing for a LLM job today, you’re going to be expected to be familiar with it. Megatron introduced tensor parallelism, a variant of model parallelism that splits the models to allow for intra-layer model parallelism, achieving 76% as efficient as a single GPU baseline (although the baseline is only 30% of peak FLOPS).

Prior to Megatron, the published SOTA for model parallelism was to use model pipelining, e.g. GPipe. However, this was difficult to do and not well supported by code. There were attempts to support tensor parallelism, e.g. Mesh-Tensorflow, which introduced a language for specifying a general class of distributed computations in TensorFlow, but nothing had really dominated. Interestingly, the first author had just left DeepMind 1 year before this was published, so this was possibly his first project at Nvidia.

Megatron has the realization that, if you have a neural network like this:

and you split

i.e. along the columns, then

so you don’t need to do any synchronization to calculate Y. Consequently, the only points where you need synchronization (all-reduces) in the transformer are:

In the forward pass, to concatenate the model activations after the MLP block before adding dropout
In the backwards pass, at the start of the self-attention block.

Now, I strongly suspect this is what GPT-3 and Jurassic-1 both did, but neither went into detail about the specific parallelism models they used, other than to say (from GPT-3):

To train the larger models without running out of memory, we use a mixture of model parallelism within each matrix multiply and model parallelism across the layers of the network.

Presumably, this style of parallelism is what is meant by “model parallelism within each matrix multiply,” as I find it hard to imagine what else they could mean.

Gopher

Abstract

Gopher was a LLM trained by DeepMind. Interestingly, the lead author joined OpenAI shortly after it was published, along with a few of the coauthors. The architecture was the same as GPT-2, except:

They use RMSNorm (instead of layernorm)
Use relative positional encoding scheme from Transformer-XL (while GPT-* used a learned positional embedding)
They use SentencePiece (instead of BPE). This seems to be an Alphabet thing; many of the Alphabet papers use SentencePiece, while most of the non-Alphabet world uses BPE.

The paper was very interesting from a computational perspective, as they went into detail about how they trained their model and made it work:

They used optimizer state partitioning (ZeRO)
Megatron-style model parallelism
And rematerialization /gradient checkpointing to save memory.

These are all now the standard techniques used to train large models. To the best of my knowledge, Gopher was the first paper to put all of these together and release details about doing so publicly.

It’s interesting— often, big labs don’t include details for competitive reasons. Here, because DeepMind was (arguably) behind, they went into extensive detail. I think we’ll see this increase with LLM research from everyone that’s not OpenAI/Anthropic, as the others don’t live/die by the commercial success of their API, and have strong incentives to make it easier for others to train large models (and thereby commoditize their complements).

For the paper, DeepMind built a dataset called MassiveText, which was as follows

Interestingly, this is much smaller than the dataset OpenAI used for GPT-3. GPT-3 had roughly 45TB of text, while MassiveText “only” had about 10.5TB.

They used this dataset to train a large model on 300B tokens. The dataset consists of 2.343 trillion tokens, so this is only 12.8%. A much smaller subset. This is interesting to compare to the earlier GPTs, which, if you recall, used 100 epochs (so they saw each token in the dataset 100 times— while Gopher only saw 10% of their tokens once)!

The Gopher appendices have some great work; someone finally did ablations! They looked at:

Adafactor vs Adam, and found that Adafactor was much less stable
Lower-precision training, trying runs with float16, bfloat16, float32, RandRound, and using bfloat16 parameters with float32 in the optimiser state (rounding randomly). They found that using float32 parameters for optimisation updates only mitigated the performance loss, saving a substantial amount of memory.
Scaling context length; they show how performance increases as the context length increases. Improvements see diminishing returns, but consistently improve. Performance looks roughly proportionate to $\sqrt{n}$ (where $n$ is the context length).

It’s really nice to see detailed empirical work like this— it’s a welcome change from the other papers that failed to do this.

Chinchilla

Abstract

Chinchilla is an incredibly influential paper that established scaling laws. It’s one of my favorite papers from the last few years, as it actually does science in a way that physicists would agree with. One answer to “is something science” is to say, if you were to meet a historical scientist in your field, could you teach them something? And if you brought Chinchilla to researchers to, say, Radford et. al in 2017, it would advance their work by several years.

Chinchilla trained over 400 GPT-style transformers, ranging in size from 70M to 16B parameters, and fit the following equation (N is the number of parameters in the LM, and D is the number of tokens in the dataset):

Choosing A, B, E, α, β to minimize

Here, we can think of E as the “irreducible loss” from the dataset, i.e. the loss if we trained an infinitely large model on an infinite stream of tokens. The authors find that the optimal model is (from nostalgebraist on into the implications of Chinchilla):

The implication here is that the model size & data size matter roughly equally, which is interesting, given how much attention & effort goes to scaling up the model, and how little attention is given to the dataset.

The authors then used this equation to determine the optimal model size for the Gopher compute budget, and trained it on more tokens— 1.4T tokens, 4.6x the number of tokens Gopher was trained on. This model, being 4x smaller, has a radically smaller memory footprint and is much faster/cheaper to sample from.

The Chinchilla paper has been highly influential. Almost every team that I’ve been talking to that is training a LLM right now talks about how they’re training a Chinchilla optimal model, which is remarkable given that basically everything in the LLM space changes every week.

The standard practice before Chinchilla was to train your model for 300B tokens, which is what GPT-3, Gopher, and Jurassic-1 all did. Chinchilla reveals how wasteful that was; basically, all of these papers made themselves more expensive to infer by training models that were too large.

Changes from Chinchilla (otherwise the same as Gopher):

AdamW instead of Adam (there’s an interesting footnote regarding the choice of optimizer: “a model trained with AdamW only passes the training performance of a model trained with Adam around 80% of the way through the cosine cycle, though the ending performance is notably better”)
Uses a modified SentencePiece tokenizer that is slightly different from Gopher (doesn’t apply NFKC normalisation)
They compute the forward + backward pass in bfloat16, but store a float32 copy of the weights in the optimizer state. They find that this is basically identically efficient to using float32 everywhere.

All of the changes are ablated extensively in the appendix (finally).

PaLM

Speaking of training models that were too large- we have PaLM! Palm was really, really big. As far as I’m aware, it’s the largest dense language model trained to date, at 540B parameters, requiring 6144 TPUs to train on (this is 3 entire TPU pods, each consisting of 2048 TPUs). This is incredibly expensive! Probably only Google has the resources + infrastructure to do this.

… unfortunately, they were training PaLM at the same time chinchilla was being written. Very suboptimal.

arxiv.org/abs/2204.02311\n\nTurns out PaLM has 7 interesting architecture improvements over GPT.\n\n1/9","username":"rasbt","name":"Sebastian Raschka","profile_image_url":"","date":"Mon Mar 20 13:10:05 +0000 2023","photos":[],"quoted_tweet":{},"reply_count":0,"retweet_count":163,"like_count":990,"impression_count":0,"expanded_url":{},"video_url":null,"belowTheFold":true}" data-component-name="Twitter2ToDOM">

Changes from GPT-3:

Multi-query attention. Shares K/V embeddings for each head, but has separate Q embeddings. Makes it much faster during inference time.
Uses parallel transformer blocks, which improves training time by 15%. As it was trained using 6144 TPU v4 chips for 1200 hours, the total training cost (at public prices) is between $1.45 to $3.22 per chip-hour, for a total of $10M to $22M. So this change saved $1.5M to $3M.
SwiGLU activations, rather than the GELU activation used by GPT-3
Uses rotary positional embeddings (RoPE) instead of the learned embeddings (I’m sad that the learned embeddings GPT-3 used are suboptimal— they’re so elegant).
Shares the input-output embeddings
No bias vectors
SentencePiece with 256k tokens

So, a ton of changes! Again, a bunch of these are common, e.g. using the learned embeddings that GPT-3 had is very passé, and almost no one does it now.

LLaMa

Abstract

LLaMa combined a bunch of the best features from PaLM and Chinchilla:

Pre-normalize the input of each transformer sub-layer
RMSNorm, instead of LayerNorm, as done in Gopher
SwiGLU activation function from PaLM (but a dimension of 2/3 4d instead of 4d, as in PaLM)
Uses RoPE, as PaLM did.
Uses AdamW, as done in Chinchilla

I think that LLaMa is the recipe to follow for the current SOTA in training large models.

Computational changes:

Uses efficient attention (Rabe & Staats, FlashAttention)
Gradient checkpointing
Interestingly, they appear to be using float32s everywhere (or at least, don’t say otherwise)

These are all similar to Gopher. The one obvious optimization they missed is to use lower precision, as Chinchilla did; I’m curious why they didn’t.

My one complaint is that I wish they would have trained the model for longer. The learning curve is very far from convergence! This paper is, in my mind, the shining example showing how well smaller models can do when trained well.

As I’ve written about elsewhere, while Chinchilla is great, it assess optimality in a very narrow sense: “With a given compute budget, and ignoring inference costs, how do we choose between the number of parameters of our model and the number of tokens we train on?” It can make sense to train a model that’s smaller than Chinchilla optimal and train it for longer than Chinchilla would tell us, because if we’re going to deploy the model at mass scale, we care much more about inference cost than training cost.

GPT-4

This is where I’d include information about GPT-4, if there was any. Unfortunately, the GPT-4 technical report contains almost no information:

GPT-4 is a Transformer-style model [33] pre-trained to predict the next token in a document, using both publicly available data (such as internet data) and data licensed from third-party providers. The model was then fine-tuned using Reinforcement Learning from Human Feedback (RLHF) [34]. Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.

As a result, I’m not going to talk about it, as there’s not much to say. Hopefully OpenAI changes their mind and releases some information about their model.

Subscribe now

Conclusion

This is it, as of March ‘23. I’m sure something new will come along and invalidate all of this.

I haven’t talked about RLHF/finetuning at all. I plan to write a future article about the various GPT variants that exist (ChatGPT, InstructGPT, WebGPT, etc.), and about how RLHF/finetuning have evolved.

What have I missed? Comment below and I’ll update this post.

Articles I’m reading:

Why didn’t DeepMind build GPT-3?

I calculated this directly by downloading bookcorpus and running tiktoken with the GPT-2 encoding on it.

The paper itself doesn’t report the number of tokens, but OpenWebText, the open source reproduction, gets nine billion, using tiktoken.

How is LLaMa.cpp possible?

Finbarr Timbers — Thu, 16 Mar 2023 17:10:17 GMT

How is LLaMa.cpp possible?

Note: Substack doesn’t have great support for LaTeX, so you might want to read this article on my blog instead.

Recently, a project rewrote the LLaMa inference code in raw C++. With some optimizations and by quantizing the weights, the project allows running LLaMa locally on a wild variety of hardware:

On a Pixel5, you can run the 7B parameter model at 1 tokens/s.
On a M2 Macbook Pro, you can get ~16 tokens/s with the 7B parameter model
You can even run the 7B model on a 4GB RAM Raspberry Pi, albeit at 0.1 tokens/s.

If you are like me, you saw this and thought: What? How is this possible? Don’t large models require expensive GPUs? I took my confusion and dove into the math surrounding inference requirements to understand the constraints we’re dealing with.

Subscribe now

Let’s start with GPUs. GPUs have two main benefits for deep learning:

They have a large amount of memory bandwidth (A100: 1935 GB/s, 4090: 1008 GB/s)
They have a large amount of compute (A100: 312 TFLOPS of FP16, 4090: 82.6 TFLOPS of FP16)

When we talk about memory bandwidth, we’re talking about how long it takes to move things from the HBM memory (i.e. the RAM) into the on-chip memory. To actually do math with the GPU, we need to move the matrices in question into the on-chip memory, which is quite small (40MB on an A100, compared to 40-80GB of RAM). The memory bandwidth is ~2 orders of magnitude smaller than the compute performance— this will matter later, as the memory bandwidth tends to be the bottleneck for inference.

What does this mean in the context of serving LLaMa? Let’s start with some inference arithmetic. We can do some rough calculations on the inference performance of a LLM using Kipply’s article.1 First, some notation on the dimensions of the model:

The Q, K, and V weight matrices are all shape [ d_model, d_head], and we have n_heads of them per layer; the attention output matrix has the same shape, for a total of 4 * [ d_model, n_heads * d_head]. By convention, GPT-style networks have d_head * n_heads = d_model.
The MLP has two weight matrices, of shape [model_dim, 4 * model_dim] and
[4 * model_dim, model]
The embedding matrix is of size [vocab, model_dim].

This gives us a handy equation for the number of parameters in a GPT-style model:2

For the duration of the post, I’m going to focus on the case where we’re running a ChatGPT style service locally, which is what LLaMa.cpp does, letting me assume a batch size of 1.

For efficient inference, the KV cache has to be stored in memory; the KV cache requires storing the KV values for every layer, which is equal to storing:

I use n_bytes here to indicate the number of bytes per param; for float32s, this is 4, for float16s, this is 2, etc. The 2 in the middle is because we have to store one set of weights for the K values, and one for the Vs.

Given a model with n layers, the total memory for the KV cache is:

In addition to storing the KV cache in memory, we also need to store the weights themselves in memory; this requires n_bytes * P bytes.

This is one of the key advantages of quantization. By using less precision, we can radically decrease the amount of memory needed to store our models in memory. Note that, with int4 precision, all of these models fit into memory on an A100 (which is the standard datacenter GPU right now), and all of them, except for the biggest model, fit into memory on high-end consumer GPUs (3090s/4090s, which have 24GB of RAM).

Now, when it comes to actually running inferece, it takes approximately 2P FLOPS per token, because we are doing a bunch of matmuls with a total of P parameters, and multiplying a matrix of size (m, n) with a vector of size (n,) has a cost of 2mn.3

With all that math out of the way, let’s calculate the requirements for running inference with LLaMa. The main requirements when it comes to sampling are:

Keep the KV cache in memory, in addition to all the parameters.
Read all the weights from HBM into the on-chip memory. Because we sample auto-regressively, we have to repeat this for each token we sample.
Do the actual matmuls to calculate the output of our network.

The latency is the maximum of either the compute or the memory latency, as reading parameters into on-chip memory happens asynchronously in all modern tensor programming libraries. As a result, we write:

where B is the batch size. As the memory bandwidth is ~1.935e12, and the number of FLOPS is ~3.12e14, as long as the batch size is less than 161, the model is memory-bound.

With a batch size of 1, this is the same equation, as on most hardware (e.g. Nvidia GPUs), there is a linear speedup as you decrease the precision (you get twice the FLOPS when using fp16 vs fp32, which doubles again as you go to int8, and doubles once more as you go to int4s).

As LLaMa.cpp uses int4s, the RAM requirements are reduced to 1.33GB of memory for the KV cache, and 16.25GB of VRAM for the model parameters. That’s pretty good!

As the memory bandwidth is almost always4 much smaller than the number of FLOPS, memory bandwidth is the binding constraint.

Note that the number of FLOPS/token is identical to the memory bandwidth required, as we have to 1) load all of the parameters into on-chip memory and then 2) use the parameters to compute the results. These happen simultaneously, as all modern tensor programming frameworks are able to handle the “loading into memory” bit asynchronously, so the total time required is max(compute time, memory time).

Running LLaMa on an A100

On an A100 (80GB PCIe), the memory bandwidth is 1935GB/s. The int4 compute is 1248 TOPS. As such, the model is (heavily) memory-bound. We should expect to see roughly 30 tokens/s with the 65B model, and 277 tokens/s with the 7B model.

Running LLaMa on a Macbook

The M1 GPU has a bandwidth of 68.25 GB/s, while the M1 GPU can do up to 5.5 TFLOPS of fp16 compute. As such, we should expect a ceiling of ~1 tokens/s for sampling from the 65B model with int4s, and 10 tokens/s with the 7B model.

As the M2 Pro has 200 GB/s of bandwidth, and the M2 Max has 400 GB/s of bandwidth, we should expect massive improvements with them, going up to 6 tokens/s with the M2 Max with the 65B model. That’s pretty darn good for a laptop.

Running LLaMa on a Raspberry Pi 4

A Raspberry Pi 4 has 13.5 GFLOPS of compute, and ~4GB/s of memory bandwidth. Given this, we’d expect to see ~2 tokens/s with the 7B model if it was memory bound. Given that we’re currently seeing ~0.1 tokens/s, I suspect we’re actually compute-bound (although this is a stab in the dark— I can’t find enough information about the specs for a Raspberry Pi to determine this with any precision).

Subscribe now

Summary

Memory bandwidth is the limiting factor in almost everything to do with sampling from transformers. Anything that reduces the memory requirements for these models makes them much easier to serve— like quantization! This is yet another reason why distillation, or just training smaller models for longer, is really important.

Note: I’m not an expert in CUDA, so I probably have errors in my math. If so, please let me know— I’ll update the post and credit you.

Resources on transformer inference performance:

Thank you to Kaushik Patnaik, immortal_333, and Arthur Allshire for reading & commenting on early drafts of this, and Salim Fakhohuri + Shuming Hu for pointing out errors in my math.

Errors that have been corrected from earlier versions:

I was missing the batch term in the latency_compute equation.
I had an extra factor of 2 in the latency_memory equation.

I learned almost all of the math surrounding transformer performance from their article; they deserve full credit.

Although we obviously don’t need to calculate the number of parameters for the LLaMa models, as we know them. The equation is useful as a sanity check.

For a more detailed discussion showing that this is the case, check out kipply’s article.

I hedge with “almost” here, but I’m not aware of any counterexamples.

A step towards self-improving LLMs

Finbarr Timbers — Tue, 07 Mar 2023 16:32:20 GMT

Here, I outline a research agenda towards making LLMs self-improve, a key problem standing in the way between current technology and AGI.

If I look at GPTs/LLMs, three of the biggest problems I see with existing techniques are:

We need our models to be able to generate data by themselves, i.e. we need a recursive self-improvement loop. AlphaZero is the shining example of what’s possible here.
We need our models to be able to operate in new domains without requiring massive amounts of existing data. CLIP provides an option here, as does Internet Explorer (the paper, not the browser).
Auto regressive sampling. It’s slow, and suboptimal.

I have better ideas for how to tackle #1, so I’ll focus on that. #2 & #3 will come later.

There are other issues facing LLMs, such as:

Increasing the length of the context window
Figuring out how to train larger models
Figuring out how to train more efficient models (less parameters, less data, less energy)
Factual accuracy
Mitigating attacks that convince LLMs to exhibit harmful behaviour (”red-teaming”), e.g. prompt injection

I think these are fundamentally engineering problems that we’ll be able to figure out iteratively. For instance, context length has seen a lot of progress with subtle algorithmic improvements; if we combine those changes with the many arcane engineering optimizations that are out there, I think we’ll get to a point where context goes to 64k tokens or more, at which point we’ll be deep in the saturating point of the sigmoid. Or for factual accuracy- I think that retrieval will largely solve that once it’s incorporated into most models.

However, I’m probably wrong, and could very well end up writing a version of this post in 2034 talking about how the biggest problem facing AGI is prompt injections.

Subscribe now

A path towards recursive self-improvement

GPTs work very well in one specific context: they are very, very good at finding text that is likely to follow other text in a way that appears natural to humans.

What they don’t do is come up with text that they haven’t seen before. Kinda. What they’re doing when we sample from them now is predict what they’ve seen during training. Sometimes these predictions produce text that hasn’t been written before (this can occur often, due to the combinatorial nature of token sampling). When this happens, it’s a happy accident. The model isn’t trying to select text that is novel or that accomplishes any goal other than following the preceding 2048 tokens (or whatever the context length is).

The obvious exception is when models are finetuned using RLHF. In RLHF, the models are explicitly trained to optimize a reward signal. In RLHF, the reward signal comes from a model trained to predict human feedback. Basically, humans are asked to choose between two samples of text, and then a model learns to predict which one is preferred.

Why does this matter? Predicting the next token works pretty well! And maybe we’re all just stochastic parrots? It matters because the biggest impediment to improving our models right now is the lack of data. The scaling law papers (Chinchilla, OpenAI) consistently point to the fact that we need to scale up the datasets we train LLMs on.

For instance, Chinchilla predicts that we’ll need 11 trillion tokens to optimally train a model the size of PaLM (i.e. 540B parameters). If we want to push past PaLM to a model with 1 trillion parameters, we’ll need 20T tokens!

That’s a lot of data. That’s so much data that it’s not clear that we can get that from existing sources. nostalgebraist argues that 1) we’ve basically exhausted the available data in structured domains like coding and 2) it’s starting to look like we’re running out of general-domain data. I find nostalgebraist compelling; the only counterargument I could see is that private data sources might be a rich vein of tokens, but I don’t see a clear path to getting access to them.

This lack of data is unfortunate because, according to Chinchilla’s scaling laws, we could see another ~8% reduction in training loss (1.93 → 1.77, delta of 0.16 in loss) for if we had infinite data while changing nothing else about Chinchilla. That’s a pretty substantial improvement when you consider that the improvement from Gopher to Chinchilla was only 2.9% (1.99 → 1.93, delta of 0.06 in loss), not to mention the fact that our models are already quite good— able to trick Google SWEs into believing they’re sentient, and scaring the Yud.

More data

The clear implication is that we need way more data! Our models are desperate for data. They’re lying on the beach gasping for more data to quench their ever-growing thirst.

But where will the data come from?

If we can scrape it we should. It’s not clear how much there is left to scrape. Especially at the largest research institutions like OpenAI, Google Brain, and DeepMind, I’m certain that they have teams of engineers working on scraping all possible data. There is some possibility to automate this process; the excellently named Internet explorer paper presented a model which crawls the web to get additional data to augment it’s dataset. Although letting a nascent AI loose on the internet would make Eliezer cry, it could be an excellent source of data, especially if one incorporates some sort of reinforcement learning style feedback loop to continually improve the manner in which the model searches the web.

The data problem is compounded by the fact that high quality data really matters. Experiments consistently show that deduplicating data increases performance substantially (https://arxiv.org/abs/2205.10487, https://arxiv.org/abs/2107.06499). Basically, I’m not convinced there is a lot more high quality data. Two exceptions might be commercial data (e.g. internal corporate documents), and all copyrighted text. But it would be extremely difficult to get access to either of these corpora.

Generate data

The solution to me seems to be self-evident: we should generate our own data. There has been some work about training LLMs on data they have generated (https://arxiv.org/abs/2210.11610, https://arxiv.org/abs/2212.08073). There are a few different techniques that seem promising here.

In Huang et. al, they use Chain of Thought (CoT) reasoning to generate additional data. Given a dataset of questions, they sample N answers that use CoT to generate an answer. At the end, they ask “The answer is “ and get an answer; they then find the majority answer and choose all texts that return the same answer as the most common answer, using this to generate additional data. In practice, there’s no reason to questions, although that is a particularly straight forward problem to apply this to; one could imagine, say, embedding all of the generated answers, clustering them, and keeping the answers in the biggest cluster, or employing RLAIF like Anthropic did in the Constitutional AI paper to select the answers to keep.

Anthropic employed a similar approach as the CoT reasoning in the Constitutional AI paper.

Another option is to use RLAIF (from Constitutional AI) to generate data. In this,

Throw compute at the problem

Yet another line of research involves throwing compute at the problem. We know that we can use a variety of techniques to soak up compute and improve outcomes. For instance, ensembling is a classic ML technique that strictly improves model performance. Given that we are already at the extreme limit of what’s possible to compute with transformers, it is almost certainly not possible to naively ensemble LLMs.

However, what we can do is use compute to apply search on top of our existing model outputs. If we can find a policy improvement operator, i.e. a function T that takes an existing distribution over tokens, π, and returns a new distribution, T(π), which improves our loss, then we can use T to improve our model. Some candidates:

Best-of-n
Beam search
Policy-driven search

“Best-of-n” (https://openai.com/research/measuring-goodharts-law) is a technique similar to ensembling in which we sample from our model N times, and use the sample with the highest score according to our objective function. This performs remarkably well (outperforming the RLHF model in the WebGPT paper(https://openai.com/research/webgpt)), is simple to implement, trivial to analyze mathematically, and trivially parallelizable, but makes inference N times more expensive. If I were OpenAI, I’d be caching the results of queries to their models and doing this for repeated queries

In the WebGPT paper, the authors found that best-of-16 resulted in an improvement in human preferences of 5% (60% → 65%), while going from 13B parameters to 175B parameters resulted in an improvement of 10% (~47% → 57%)1

As both charts see roughly linear improvements in performance, and both models increase in cost roughly linearly, it seems to imply that a best-of-64 13B model would be better than a best-of-4 175B model, while having roughly the same cost in terms of compute. Given that the 13B model fits on a single GPU, this would substantially lower the overall compute needs of the system.

Another improvement operator is a NLP classic: beam search! In beam search, one performs a breadth-first search over the model outputs, with finite depth and width of the tree (e.g. it only keeps N successors at each level of the tree, and searches to a depth of M levels), with the final result being the sequence with the maximum objective score (typically log-likelihood). While a number of the LLMs do use beam search,2 they don’t appear to report performance numbers, so I’m unable to include a comparison of how much it matters.

A concern is that beam search lowers diversity, as it constricts the difference in tokens; this is especially problematic for byte-level tokenizers, like BPE, as the individual tokens might vary significantly. Sander Dieleman wrote about how strategies like beam search are “the culprit behind many of the pathologies that neural machine translation systems exhibit”.3

The final candidate (or family of candidates) for the improvement operator is an option that I find very exciting: learning an algorithm to search the token tree. The idea is that we could do something like AlphaZero which would learn a policy + value function.4 This would also allow us to change the reward function if we wanted to, rather than just using the standard log-likelihood. We could, for instance, directly train the reward function on human data. If you’re serving data to millions of users per day, you could just directly run RL on that, which is the case for the myriad of chat bots on the market today (Bing, ChatGPT, Claude, etc.).

Next steps

Now, I am not employed by a lab studying AGI. So I do not have the ~~resources~~GPUs to apply any of these strategies. If you’re inspired by any of these ideas and want to implement them, please do so. I’d love to hear from you.

I’d particularly love to hear from you if you disagree with me. What am I wrong about?

Results from eyeballing graph, not precise.

Examples include: GPT-{2,3}, which uses it during decoding for text generation, BERT, for language understanding tasks, T5, and XLNet.

Sander’s post is great. I was struggling to understand why beam search isn’t used more in practice, and his post did a great job helping me understand why.

Perhaps using MuZero with a smaller model as the recurrent function to save on compute.

Papers I've read this week

Finbarr Timbers — Sun, 05 Mar 2023 16:26:10 GMT

Hey folks! Welcome to my newsletter. I’ve got an article that I’m hoping to drop early in the week about a potential solution to one of the biggest problems I see with LLMs: the lack of data. Until then, I’ve got a post about some of the most interesting papers I’ve read in the last week.

I’m going to try to write a weekly summary of the most interesting papers I’ve read that week. I’d love to hear what papers you’ve been reading, if you agree/disagree about my conclusions for each paper, and/or suggestions for what papers I should read next!

Scaling laws for routed language models

Abstract

I used to work with Aidan, and way back in 2021, he was insistent that LLMs were the future of AI. I thought he was crazy. In ‘22, he also insisted that conditional routing models were the future. Given how right he was about LLMs, it’s probably worth paying attention to conditional routing models.

The paper provides a great general overview of how routing networks work and performance comparisons (in terms of negative log likelihood over a validation dataset) for the 3 most common routing techniques (sparse MoE, non-parametric HASH, RL routing).

The authors trained a large number of conditional routing networks, and fit scaling laws to the results; they find that all 3 techniques follow the same scaling laws, with RL routing doing quite well. I’d be curious to see how much effort has been put into improving RL routing; I suspect that it could be improved significantly.

The authors observed the following results:

Routing improves the performance of language models across all sizes and variants attempted
Training a Routing Network with RL is of comparable effectiveness to state-of-the-art techniques.
The performance of all Routing Networks is accurately described by scaling laws in the number of experts and in the underlying dense model size.

I was surprised at how similar the performance of the various techniques was. The data was quite nice, with little variation. The scaling laws seem to fit the data quite nicely.

One interesting result was that routing helps significantly more when the model is smaller. I found this surprising; my intuition is that routing should always help. They found that this was the case across all models, and that routing helped less as the models grew.

The paper ends with recommendations, which I found really useful:

Use routing for models with less than 1.3B parameters
S-Base is a good, default routing algorithm (defined in the appendix of their paper).
Target using E in {64, 128} experts.
Use K = 1 experts; route layers with a frequency between 0.5 & 1; lower frequency reduces performance.

Internet explorer

Abstract

In this paper, the authors create an agent which dynamically explores the internet, running text queries to find images to use for self-supervised training. While seemingly designed to directly antagonize Yudkowsky, the paper is extremely interesting, and presents, to me, a potential future direction for AGI research. As Chinchilla showed us, LLMs could massively improve with more data. Having agents dynamically exploring the internet is one excellent way to get more data- especially if they’re able to adaptively learn over time and prioritize images accordingly.

In the paper, they train a model to learn representations of images based on MoCo-v3. They query Google Images for new images, ranking the query results by similarity to the target dataset, assigning a reward to the new images:

Here, S_cos is the cosine similarity, f_k is the image encoder, D := {x_i} is the target dataset, and y is the new image to evaluate, and they evaluate over the k closest neighbours in the target dataset (where “closest” is determined by the encoded representation).

They create the queries for Google Images by sampling them from a static vocabulary dataset. For the sampling process, they estimate the reward associated with the query using a Gaussian process regression, and choose the queries with the highest reward.

I’d be really interested to see a fancier query generation process. One idea that comes to my mind would be using RL to train a LLM to generate queries in a manner similar to what’s done in RLHF/RLAIF, i.e. use an RL algorithm like PPO to finetune a pretrained LLM to maximize reward. This would require much more compute.

LLaMa

Abstract

I’ve been digesting the LLaMa paper that Facebook released this week. It was very interesting to see the performance increases they got despite the size decreases. Their 13B model outperformed GPT-3 on a number of benchmarks, and their 65B model was competitive with Chinchilla-70B and PaLM-540B (!).

I did find it incredibly frustrating that they stopped training when they did; their loss curves are all looking pretty far from convergence, and I’m curious to see how much the models will continue to improve:

I wish that they had just left it training.

My biggest question about the paper is that it’s not clear what caused the improvements. They discuss a few major changes compared to GPT-3, which their model is based on:

They only use publicly available data, but it’s unclear what exactly the filtering steps are. I wish they’d open source their dataset (or at least, the code to clean it).
They normalize the input of each transformer sub-layer, rather than the output.
They use the SwiGLU activation function, as PaLM did, with a slight dimensional difference compared to PaLM.
They use Rotary Embeddings for the positional encoding.1

There was no ablation study, unfortunately. If I can scrounge up the GPUs, I’m tempted to do my own ablation based on nanoGPT. They also use FlashAttention, which I suspect will become the default attention implementation used in LLMs going forward.

And that’s it! Thanks for reading. If you have thoughts on any of these, or interesting follow up papers, I’d love to hear them.

I haven’t seen a great paper comparing various positional encoding schemes. I don’t really understand which are better, and if this is generally true, or if performance varies in certain scenarios. A proper positional encoding ablation study is on my list of experiments to do once I can scrounge up GPUs.