Papers I’ve read this week, Mixture of Experts edition

I read a bunch of papers about conditional routing models

Aug 04, 2023

Papers I’ve read this week, Mixture of Experts edition

Mixture of Experts (MoE) models have been getting a lot of attention lately, what with the all the rumours about OpenAI using them in GPT-4. I’ve been reading a lot of the foundational papers about MoE models, and I’ve taken detailed notes, which I wanted to share. This is a bit of a long one, so you might want to read this on the web.

Background

A standard deep learning model uses the same parameters for each input. For instance, when using a ResNet to classify images, the same parameters are used for every image. With conditional routing models, such as MoEs, the parameters each input sees vary depending on the input. For MoE models, in particular, they select different combinations of parameters for each input. We use the term “dense” to refer to plain old vanilla models which do not vary the parameters.

The name can be somewhat misleading. The MoE technique, at least as it is commonly used, is not an ensembling technique that is used to combine models trained independently. Rather, it is a clever way to distribute a large transformer over multiple GPUs. It is more analogous to a different type of model parallelism, such as the Megatron architecture. Each “expert” can be anything from an entirely separate transformer to a different constituent block of the transformer. Within the network, you have “router” layers which are typically much smaller than the rest of the network, and which decide how to allocate tokens across experts. The exact way in which this is done varies significantly by method.

When using conditional routing models, several problems crop up:

Allocating too many tokens to the top models. When learning to route models to experts, a “winners get bigger” effect exists, in which a positive feedback loop can cause the most successful model to see all of the tokens, causing that model to get better, and thus to have more tokens routed to it. This is problematic, as in the limit, this will lead to the model becoming a standard dense transformer.
Poor sharding performance. Even if the tokens are being allocated to all models in expectation, for efficient accelerator utilization we actually want each batch to be evenly allocated across experts. This can be hard to achieve.
Evaluating performance vs dense models. It can often be unclear how to compare the models. For instance, there were several labs which trained MoE models with >1T parameters. This was compared directly to GPT-3, which “only” had 175B parameters. The comparison is, of course, invalid, as only a fraction of these parameters are used for each request.

The papers I discuss below attempt to solve these issues.

One paper I will not be discussing is an excellent survey paper by Clark et al, which is an overview of how conditional routing models work, and which does a number of strong experiments to help understand their performance. I wrote about it in an earlier article, and will not repeat myself here.

Outrageously large networks

Abstract

The OLNN paper proposed a Sparsely-Gated Mixture-of-Experts Layer, which was one of the first MoE implementations that was practical to use at scale, and was performant.

It was by Noam Shazeer, as seemingly all papers advancing the state of the art in NLP are, and in it, the authors apply a MoE layer convolutionally between stacked LSTM layers. Their proposed Mixture of Expert layers consist of a set of N expert networks (E(x)), and a gating network, G, that outputs a sparse, n-dimensional vector. The gating network decides how to combine the networks:

\(y = \sum \limits_{i=1}^n G(x)_i E_i(x)\)

If G(x)_i = 0, we can skip computing E_i(x), which can be expensive (it can be, for instance, a decoder block from a transformer). You can also stack gating networks to create a hierarchy of networks. A common choice for MoE models is for G(x) to be a linear layer with a softmax activation:

\(G(x) = \text{softmax}(\textbf{x}W^T)\)

Here, the authors add sparsity, and random, learned, Gaussian noise:

\(G(x) = \text{softmax}(\text{topK}(H(\textbf{x}), k))\)

topK(x, k) is the identity function for the top K values, and sets the rest to negative infinity, and H(x) is a trainable noise function:

\(H(x)_i := (W^T_g\textbf{x})i + \epsilon \cdot \text{softplus}(W^T_\text{noise} \textbf{x})_i, \epsilon \thicksim N(0, 1)\)

This complicated function allows the gating network to be trained using backprop in the standard way. It also has the nice property that the model scales proportional to k*b*d/n, where we choose k out of n experts, a batch of size b, and distribute the model over d devices. As such, if we maintain the k/n ratio static, then we can continue to scale the number of experts (and thus the total model size) as long as we have given Jensen Huang enough money to have the GPUs we need.

Interestingly, the model was written when LSTMs were the SOTA in language modelling, but the proposed architecture seems like a remarkably strong fit for current architectures, as it makes it easy to shard the experts over a number of GPUs. The paper uses a load balancing loss to try and balance expert utilization. The load balancing loss is an area of active research, and is one that will be focused on in subsequent papers. The loss used here is:

\(L_{\text{importance}} = c_{\text{importance}} \cdot \text{CV}(\text{Importance}(X))^2 \)

Where the importance is defined as the total weight that all the experts place on the batch X:

\(\text{Importance}(X) := \sum \limits_{x \in X} G(x)\)

CV here is the coefficient of variation, i.e. the ratio of the standard deviation to the mean:

\(\text{CV}(x) := \frac{ \sqrt{\text{Var}(x)}}{ E[x]} \equiv \frac \sigma \mu\)

This loss will encourage equal importance, as the easiest way to minimize the loss is to minimize the variance of the importances. The authors also use an additional penalty term to further encourage load-balancing:

\(L_{\text{load}} = c_{\text{load}} \cdot \text{CV}(\text{Load}(X))^2\)

The load is defined similarly to the Importance, but they define a new function, P(x, i), as the probability that the i-th entry of G(x) is nonzero when you add noise to it:

\(\text{Load}(X)i = \sum_{x \in X} P(x, i)\)

They have a somewhat complicated way to actually implement P(x, i), and it’s worth checking out the appendix for the details. This results in their load balancing loss becoming

\(L(X) = L_{\text{load}}(X) + L_{\text{importance}}(X) \)

The authors also explored yet another loss in which they forced the model to strictly balance tokens across experts. It’s interesting to see how much time was dedicated to the load balancing; this is indicative of how critical this is to get the full benefit of MoE models.

The authors found a significant improvement in modelling performance when using their MoE models, seeing significant improvements when comparing models with approximately 8M FLOPS per timestep:

In the era of LLMs, it’s somewhat funny to look at the number of model parameters here; their largest model has 4B parameters, which would be considered quite small today.

Switch Transformers

Abstract

Another Shazeer paper, this was inspired by Kaplan et. al, which found that GPT model follow power-laws in scaling the model size, dataset size, and computational budget. They add add a fourth dimension: the parameter count, keeping the FLOPs per token constant. Like the OLNN paper, by keeping unique experts on different devices, the total parameter count for the model increases proportional to the number of devices, allowing for embarrassingly parallel operations.

The paper is a minor modification on the OLNN paper. They make two changes:

The authors use k = 1 (which they call a Switch layer), whereas OLNN routing used k > 1. They show this performs better.
They use an auxiliary load balancing loss that was the same as Shazeer (2018) and Lepihkin et al (2020), and is much simpler than the overly complicated loss(es) used in the OLNN paper:
\(\mathcal{L}_{\text{load balancing}}(x) := \alpha N \sum \limits_{i=1}^N f_i P_i,\)

\(f_i := \frac 1 T \sum \limits_{x \in \mathcal{B}} \mathbb{1}\{\text{argmax } p(x) = i\}\)

\(P_i := \frac 1 T \sum \limits_{x \in B} p_i(x) \)

P_i is the fraction of tokens allocated to expert i across all tokens in the batch, while p_i(x) is the probability of allocating token x to expert i. As we want to spread tokens evenly across the N experts, we want both the f_i and the P_i terms to be equal to 1/N, for all i, which is encouraged through this, as that is the optimal value for the P-vector to take, which is the only differentiable part of this equation. It is multiplied by N to keep the entire term constant (under uniform routing, the sum is equal to N * (1/N^2) = 1/N).

They find that these small changes are enough to perform better than both MoE and Dense transformers with equivalent runtime performance. The paper also asks a Chinchilla-esque question:

For a fixed training duration and computational budget, should one train a dense or a sparse model?

They find a significant advantage to training a sparse transformer vs a dense one, given a fixed computation budget. This matches my intuition; with much more parameters, and an efficient way of selecting between them, the model should have much higher performance.

As an applied practitioner, I think looking at inference cost is the better benchmark to use, which is even more in favour of sparse transformers. Once a MoE model has been trained with some number N > K experts of size P, where each inference involves routing tokens to the top K experts, it has the same performance characteristics as a dense model with KP parameters. As typically the inference costs will dramatically outweigh the training costs, this becomes the relevant benchmark.

Put differently, if we’re deploying thousands of GPUs to serve inference requests, we don’t have preferences over deploying them to serve a larger number of copies of the same dense transformer, or using them to serve a smaller number of MoE transformers: we’re computationally agnostic (assuming that we’re able to achieve uniform routing).

Two papers that the authors reference repeatedly throughout the Switch Transformers paper are the GShard and Mesh-Tensorflow papers, which go into depth about the computational details of implementing these models. They’re worth a read if that’s of interest to you.

ST-MoE

Abstract

I’m apparently a Shazeer fanboy today- a third paper1 of his in this line was the Stable and Transferable Sparse Expert Model paper, which built on the Switch Transformer paper to introduce the ST-MoE transformer. The paper focuses on the training instabilities that come up in training sparse expert models.

The paper focuses on identifying approaches that can improve training stability for sparse models. Training instability here means training runs which diverge:

The authors explore a bunch of methods to stability models, but note that many of them degrade model quality, which is undesirable. They also note a number of changes which worsen stability, but increase quality (e.g. modifications with more multiplicative components, like GEGLU or RMSNorm). They propose a new loss, the router z-loss, which doesn’t degrade quality:

\(L_z(x) := \frac 1 B \sum \limits_{i=1}^B \left( \log \sum \limits_{j=i}^N e^{x_j^{(i)}}\right)^2\)

where B is the number of tokens, N is the number of experts, and x are the logits going into the router. This is a penalty on large logits into the gating network. As such, the overall loss for the model becomes (yet another complicated, multi-part loss function!):

\(L_{\text{total}} = L_{\text{cross-entropy}} + c_{\text{load-balancing}} L_{\text{load-balancing}} + c_zL_z\)

The idea is that modern transformers are trained with mixed precision, typically using bfloat16 for all matmuls other than gradient updates, which are stored in float32. As such, larger logits involve larger roundoff errors (table from the ST-MoE paper):

The routing layers in MoE models introduce additional exponential functions through the softmax function, and exponential functions, as they increase the scale of the values being considered, can lead to higher roundoff errors. This is particularly problematic when routing with k > 1 experts.

The authors run a number of experiments comparing top-K routing, and find that top-2 tends to be the sweet spot in terms of complexity and compute. While you can increase quality by increasing the size of K, it comes at a linear increase in compute cost.

The authors also perform a qualitative analysis of what the experts learn; they find that various experts specialize in punctuation, visual descriptions, proper names, and counting/numbers. This is interesting as that isn’t forced by any of the specific losses, but rather emerges from the model, as one might naively suspect. However, they don’t observe that experts specialize in languages when trained on multilingual datasets, which I found surprising. This sort of qualitative analysis is excellent and often missing. I wish more papers had this!

With this paper, we start to see the load-balancing losses simplifying and converging on more interpretable, manageable, functional forms, when compared to the more complicated setups from the earlier papers, such as OLNN.

Interestingly, the last author on this paper, William Fedus, is currently at OpenAI. He joined Jack Rae and Aidan Clark there, two of DeepMind’s authors from the “Scaling laws for routed language models” paper (although Jack Rae recently returned to DeepMind). Similarly, all of the authors on the Switch Transformers paper (William Fedus, Barret Zoph, and Noam Shazeer) have left Google, with Zoph and Fedus going to OpenAI. OpenAI seems to have accumulated quite a stable of sparse transformer experts! (I’m curious how they route work between themselves. 🥁💥)

Aug 4, 2023

I can't wait to see how these advancements will shape the future of AI. Keep up the fantastic work!

Expand full comment

Nathan Lambert

Great article. Some of these intuitions, such as why MoE can fail in the beginning and how the MoE can be equivalent to a dense model with some many params are super useful.

Any insight into methods that could be solved to make the routing problem better? Additionally, as OAI is not really a consumer company, I doubt their decision was made based on a lower inference cost (they just want the best models to find AGI-like things). Thoughts on what that means?

1 reply by Finbarr Timbers

1 more comment...

Artificial Fintelligence

Discussion about this post