On Saturday, Meta launched its latest AI large language model – Llama 4. In fact it launched two models (Llama 4 Scout and Llama 4 Maverick) and announced one more (Llama 4 Behemoth). New LLMs coming along almost don’t feel like news these days – why should we care about this one?
Silicon Valley Responds to the AI Arms Race
If you hadn’t been paying attention there’s an AI arms race going on. Just a few months ago the status quo was disrupted when a little-known Chinese startup, DeepSeek, launched its groundbreaking V3 and R1 models. Saturday’s news marked Silicon Valley’s first response. So keen was Meta to respond, it didn’t even want to wait until Monday to announce the model.
A few months back – although AI months feel more like dog years – and the main protagonists were OpenAI, Meta, and Google with Anthropic and Mistral trailing closely behind.
Between them, they’d been launching notable new “frontier-level” LLMs every few months – the most advanced LLMs with 100s of billions of parameters. Each model improved in capability and was generally priced around $10-$150 per one million tokens.
Company |
Release Date |
Model Name |
Parameters |
Token cost/million |
OpenAI | 2022/11 | GPT-3.5 (ChatGPT) | 175B | $3.00 |
OpenAI | 2023/03 | GPT-4 | 1.8T | $36.00 |
OpenAI | 2023/11 | GPT-4 Turbo | Unknown | $30.00 |
2023/12 | Gemini 1.0 | 1.5T | $150.00 | |
2024/02 | Gemini 1.5 Pro | 200B+ | $10.00 | |
Meta | 2024/04 | Llama 3.1 405B | 405B | $3.5 |
OpenAI | 2024/05 | GPT-4o | Unknown | $10.00 |
2024/12 | Gemini 2.5 Pro | Unknown | $9.00 | |
DeepSeek | 2024/12 | Deepseek v3 | 671B | $0.48 |
OpenAI | 2024/05 | GPT-4.5-Preview | 12.8T | $150.00 |
DeepSeek | 2025/03 | Deepseek v3.1 | 685B | $0.48 |
Meta | 2025/04 | Llama 4 Maverick | 400B | $0.19 |
DeepSeek Changed the Game
Then DeepSeek V3 came along. It was roughly as good as GPT-4o (the state of the art all the way back in December 2024) but 27 times cheaper to run. It upended the economics of LLMs. Some people suggested it led to “panic” at companies like Meta – although I think it probably led to some serious contemplation and head scratching about how they could respond.
It is thought what made DeepSeek V3 so good was the use of distillation training – this is the process of transferring the capabilities of a large “teacher” model to a smaller “student” model. Allegedly – although not confirmed - DeepSeek V3 was trained on OpenAI’s models.
Distillation – the new (old) training technique
We’ve seen the distillation technique work well before, lots of Open-Source community models use the technique to create smaller experts, but DeepSeek’s capabilities has shown Meta – and others - how good distillation can be.
Meta placed this technique front-and-centre in its release – in its announcement the company said: “these models are our best yet thanks to distillation from Llama 4 Behemoth.”
Mixture of Experts – giving distillation the edge
It’s not just DeepSeek’s distillation training that Meta embraced, it also notably used the Mixture of Experts (MoE) technique in Llama 4 that combines a number of smaller “expert” models into one large model. One model might be an expert at chat, another an expert at coding.
MoE is a technique that OpenAI has used since GPT-4 (which has eight 220 billion parameter expert models) but Meta had preferred monolithic models until Llama 4 – it said in the Llama 3.1 announcement that “we opted for a standard decoder-only transformer model architecture with minor adaptations rather than a mixture-of-experts model to maximize training stability.”
But being a combination of smaller experts, each MoE model gains more benefit from distillation.
Better models – why does it matter?
Returning to the original question, why do we care about these new models? The answer comes down to a few key metrics:
It’s widely accepted now that to replace workflows using AI we need to run models multiple times to iterate through processes rather than trying to do it all at once.
The problem with previous models is they have either been relatively slow and expensive with good quality or faster and cheaper but quality suffers.
This new generation of models – Llama 4 and DeepSeek V3.1 – are fast and cheap to run, but also produce very good quality output. And as a bonus, they are both open source so you’re not tied to one vendor to run the model. You can even run them on your own hardware if your pockets are deep enough.
As a result, generative AI is about to get much better – and not only this, it seems that we are coalescing around a best practise of combining distillation and MoE to make the best models.