2 Comments
Apr 15, 2023·edited Apr 15, 2023Liked by Finbarr Timbers

> In particular, it’s not clear to me why we need to go from CLIP to CLIP

I also found this confusing when I read the DALLE2 paper (see https://www.lesswrong.com/posts/XCtFBWoMeFwG8myYh/dalle2-comments).

Not long after that paper, Google came out with Imagen (https://arxiv.org/abs/2205.11487), which is reportedly better than DALLE2 (in a head to head comparison) despite using a much more obvious approach:

- conditioning with cross-attention to a text encoder, as in GLIDE

- but, using a powerful pretrained text encoder rather than training one end-to-end from scratch

The Imagen paper focuses on the case where the text encoder is T5, but they also tried using CLIP's text encoder, and got very similar (if slightly worse) results.

Before I saw Imagen, I thought "this unCLIP idea does not make sense theoretically, but apparently it helps in practice." After I saw Imagen, I wasn't even sure anymore than it helped it practice. (It is superior to GLIDE, which doesn't have the benefit of access to CLIP, but it can be beaten by a more relevant baseline that does have access to CLIP and uses it in the obvious way.)

The Imagen approach, of GLIDE-style conditioning with a pretrained CLIP text encoder, was also used independently in

- Katherine Crawson's v-diffusion (https://github.com/crowsonkb/v-diffusion-pytorch), developed before DALLE2 or even GLIDE existed

- Stable Diffusion, which Crawson also worked on

Expand full comment