Before I studied machine learning, I was an Econ grad student banging out OLS problem sets (I see the OLS equation— (X’X)^-1X’y
— whenever I close my eyes, I derived it so many times). My research area was antitrust theory, and in particular, vertical integration. That gives me a unique perspective: how will the LLM API market evolve as more companies enter the space?
The market began, famously, with OpenAI releasing ChatGPT and rapidly hitting $1.3B in revenue. At this time last year, however, there was basically no competition in the LLM API market. Bard was yet to be released, let alone Claude, and Gemini was a mere twinkle in Sundar’s eyes. OpenAI had a monopoly in the market, letting them capture basically all of the value.
In the year since, what we’ve seen is that there doesn’t appear to be a moat in LLMs except at the highest end. GPT-4 is the only model which doesn’t have competition, and there are competitors sniffing around— Gemini Ultra, Llama 3, and the as-yet-unreleased mysterious Mistral model bigger than medium. At the GPT 3.5 level, however, you have many options for hosting, and you can even host it yourself. This necessarily limits the prices any company can charge.
Generally speaking, companies enter a new market when they think they can make a profit above the minimum threshold they require. The larger the company is, the smaller the profit threshold they require. If I, an individual, were to start offering a service to finetune LLMs, I would need to charge a fairly high margin at first, as I would have a small customer base to spread the costs over. As my company grows, I would have a larger customer base to spread the costs over, and would have more money to spend on optimizations enabling me to serve LLMs for cheaper:
Quantization
Buying your own chips instead of renting them
Distilling models
Building your own chips
with each optimization that you do to make your own process more efficient, you increase your margin. That’s great! You make more money per token. Right? Well, not quite. In a vacuum with a spherical cow, you do. But just as you invest in your ability to serve tokens more efficiently, your competitors are all doing the same, eroding your margins. To do a bad Ben Horowitz impersonation, You run this hard just to stay in place.
The necessary implication is that the undifferentiated LLM market will become a ruthless competition for efficiency, with companies competing to see who can demand the lowest return on invested capital.
In the classic business strategy book, the Innovator’s Dilemma, there lives what is the canonical example for how technological disruption happens (this is taken from the New Yorker profile on the author, Clayton Christensen):
In the world of steel manufacturing, historically, steel was made in massive integrated mills. They made high quality steel with reasonable margins. Then came along electric mini mills. These mills were able to make the lowest quality steel at a cheaper cost. The large steel manufacturers saw this, shrugged, and focused on making high quality steel at a (relatively) high margin. Over time, the electric mini mill operators figured out how to make higher and higher quality steel, moved upmarket, and killed the massive integrated mills (US Steel— once the 16th largest corporation by market cap in the US— was removed from the S&P 500 in 2014).
The analogy to LLMs is straightforward. The large labs focus on making the highest performing models. They are expensive, but excellent, and outperform every other model. However, they are expensive. You need margin to pay for all of those $900k engineers! Even then, however, we see competition on price. Gemini Pro
At the low end, we have the open source community, led by Meta and r/LocalLlama, which are cranking out high quality models and figuring out how to serve them on ridiculously low powered machines. We should expect to see the open weight models improve in quality and decrease in cost (on a quality adjusted basis), putting pressure on the margins of the largest labs. As a real-time example, Together came out with a hosted version of Mixtral that is 70% cheaper than Mistral’s own version.
We should thus expect a bifurcated market. At the high end live more expensive, higher quality models, and at the low end, lower quality, less expensive models. For open weights models, we should expect their price to converge to price of GPUs + electricity (and as competition increases in the GPU market, perhaps just to the price of electricity).
The question, then, is what does the buyer for these APIs look like? If we were to rank the economically valuable tasks that LLMs can perform from most complex to least complex, how many of the tasks require high end complexity? At some point there’s a threshold where GPT-4 is required, but it’s hard to image that the threshold will remain static. The open weight models will continue their inexorable climb up the list, biting at the margins of the large labs. As tooling makes it easier to effortlessly switch between model APIs, the developers using the API will switch to whatever the lowest cost model is that accomplishes their task. If you’re using a LLM for, say, short-length code completion, do you need the biggest and best model? Probably not!
Moreover, the companies with the biggest success in the consumer marketplace will inevitably start to balk at paying a significant amount of their profits to another company, and will start to train their own models. We see companies like Harvey, and Cursor, which were some of the companies with the earliest access to GPT-4, start to hire research scientists/engineers, giving them the talent required to train their own foundation models. As API fees are probably the biggest expense for these companies, it seems natural that they will do everything they can to reduce their costs as much as possible.
If you’re building your own models, you can go out and raise a round of investment to invest in your own models, trading off a one-time capital expenditure to increase your overall margins. This is the justification for Google’s TPU program, for example. By spending billions of dollars on custom silicon, they’re able to avoid paying Nvidia’s Danegeld.
The conclusion, then, is that we will see the market for LLM APIs converge to one of lowest cost as long as your task is simple enough to be solved by open weight models. If your task is so complex that it requires the best model, you’re stuck paying OpenAI. For everyone else, there’s finetuned Mistral 7B.
Love the personal background. Would love to see more commentary on how scaling laws and capital costs predictions fit into this too
I got excited when I thought you were going to audit some major use cases and specify which were out of reach today, which can be done by gpt4, and which can be done by open weight models. Ofc the possibilities are infinite but some tangible examples of the “low end steel” market would be v interesting